Stereo R-CNN based 3D Object Detection for Autonomous Driving (1902.09738v2)

Published 26 Feb 2019 in cs.CV and cs.RO

Abstract: We propose a 3D object detection method for autonomous driving by fully exploiting the sparse and dense, semantic and geometry information in stereo imagery. Our method, called Stereo R-CNN, extends Faster R-CNN for stereo inputs to simultaneously detect and associate object in left and right images. We add extra branches after stereo Region Proposal Network (RPN) to predict sparse keypoints, viewpoints, and object dimensions, which are combined with 2D left-right boxes to calculate a coarse 3D object bounding box. We then recover the accurate 3D bounding box by a region-based photometric alignment using left and right RoIs. Our method does not require depth input and 3D position supervision, however, outperforms all existing fully supervised image-based methods. Experiments on the challenging KITTI dataset show that our method outperforms the state-of-the-art stereo-based method by around 30% AP on both 3D detection and 3D localization tasks. Code has been released at https://github.com/HKUST-Aerial-Robotics/Stereo-RCNN.

Authors (3)

Peiliang Li (15 papers)
Xiaozhi Chen (18 papers)
Shaojie Shen (121 papers)

Citations (480)

View on Semantic Scholar

Summary

The paper presents a Stereo R-CNN framework that simultaneously detects and associates objects across stereo images to form coarse 3D bounding boxes.
It employs dense photometric alignment with semantic keypoint estimation to refine 3D detection by minimizing reprojection errors.
Evaluation on the KITTI dataset shows approximately a 30% improvement in both 3D detection and localization compared to state-of-the-art stereo methods.

Stereo R-CNN based 3D Object Detection for Autonomous Driving

The paper under review presents a novel approach to 3D object detection in autonomous driving using a Stereo R-CNN architecture. This research focuses on exploiting the rich geometric and semantic information available in stereo imagery, circumventing the limitations associated with modalities like LiDAR and monocular cameras.

Methodology and Network Architecture

The proposed Stereo R-CNN method extends the Faster R-CNN framework to handle stereo image pairs, detecting and associating objects across left and right images simultaneously. The architecture is composed of a Stereo Region Proposal Network (RPN) followed by additional branches for predicting sparse keypoints, viewpoints, and object dimensions. These outputs aid in constructing a coarse 3D bounding box, further refined using a dense photometric alignment technique.

Notably, the method foregoes the necessity for depth input and 3D position supervision. The architecture utilizes ResNet-101 and FPN for feature extraction, ensuring a robust baseline, while stereo-specific feature concatenations and multi-task loss balancing are employed for performance enhancement.

Key Contributions

Stereo R-CNN Framework: Simultaneous detection and association across stereo images are achieved without additional computation, effectively associating object instances between left and right images.
3D Box Estimation: The use of semantic keypoints, alongside stereo boxes, enables more accurate 3D bounding box estimation by minimizing reprojection errors across multiple geometric constraints.
Dense Photometric Alignment: This innovative method aligns stereo images at the pixel level to refine the depth estimation. By treating the object as a regular shape, this alignment reduces error dependency on independently estimated pixel disparities.

Experimental Evaluation

The paper demonstrates the effectiveness of the proposed method on the KITTI dataset, showing that it surpasses current image-based methods in 3D detection and localization. The results are significant, with approximately 30% improvement in both 3D detection and localization tasks over state-of-the-art stereo-based methods.

Implications and Future Directions

The results have compelling implications for real-world autonomous driving systems, primarily due to the cost-effectiveness and potential scalability of stereo vision systems compared to LiDAR. This approach could lead to more affordable autonomous systems without sacrificing depth accuracy critical for navigation.

Future research could explore the extension of this framework to multi-object tracking and the integration of instance segmentation for more accurate RoI refinement. Additionally, incorporating learned shape priors could improve the generalizability of the system across different classes of objects and scenes.

In conclusion, the Stereo R-CNN approach marks a substantial step forward in leveraging stereo imagery for 3D object detection in autonomous driving, offering a promising alternative to traditional sensor configurations.

PDF Markdown

Related Papers

GitHub

GitHub - HKUST-Aerial-Robotics/Stereo-RCNN: Code for 'Stereo R-CNN based 3D Object Detection for Autonomous Driving' (CVPR 2019) (699 stars)