DSGN: Deep Stereo Geometry Network for 3D Object Detection (2001.03398v3)

Published 10 Jan 2020 in cs.CV

Abstract: Most state-of-the-art 3D object detectors heavily rely on LiDAR sensors because there is a large performance gap between image-based and LiDAR-based methods. It is caused by the way to form representation for the prediction in 3D scenarios. Our method, called Deep Stereo Geometry Network (DSGN), significantly reduces this gap by detecting 3D objects on a differentiable volumetric representation -- 3D geometric volume, which effectively encodes 3D geometric structure for 3D regular space. With this representation, we learn depth information and semantic cues simultaneously. For the first time, we provide a simple and effective one-stage stereo-based 3D detection pipeline that jointly estimates the depth and detects 3D objects in an end-to-end learning manner. Our approach outperforms previous stereo-based 3D detectors (about 10 higher in terms of AP) and even achieves comparable performance with several LiDAR-based methods on the KITTI 3D object detection leaderboard. Our code is publicly available at https://github.com/chenyilun95/DSGN.

Citations (176)

View on Semantic Scholar

Summary

The paper introduces a novel DSGN that employs a differentiable 3D geometric volume to jointly learn depth and object detection from stereo images.
The end-to-end pipeline integrates pixel-level stereo matching with high-level semantic extraction, narrowing the performance gap with LiDAR-based methods.
Results show a significant 10 AP point improvement over previous stereo detectors, demonstrating potential for cost-effective 3D object detection.

DSGN: Deep Stereo Geometry Network for 3D Object Detection

The paper presents the DSGN, a novel methodology for 3D object detection using stereo vision, showcasing significant advancements in reducing the performance gap between image-based and LiDAR-based detection methods. Image-based 3D detection methods have traditionally lagged in accuracy compared to those utilizing LiDAR due to their inability to effectively form representations for prediction in 3D space. By introducing the concept of a Deep Stereo Geometry Network (DSGN), the authors provide insightful developments that could lead to more efficient image-based 3D object detection solutions.

Overview of DSGN

The DSGN achieves 3D detection through a differentiable volumetric representation known as a 3D geometric volume (3DGV), which effectively encodes the geometric structure of the 3D scene. This approach simultaneously learns depth information and semantic cues, establishing the first one-stage stereo-based 3D detection pipeline. The method allows for joint estimation of depth and object detection in an end-to-end manner and demonstrates marked improvements over previous stereo-based detectors, achieving scores approximately 10 points higher in Average Precision (AP) metrics compared to predecessors. Furthermore, DSGN shows comparable proficiency to several LiDAR-based methods on the KITTI leaderboard, underscoring its practical significance.

Methodological Innovations

3D Geometric Volume: The innovation lies in transforming plane-sweep volumes constructed in camera frustum to regulate 3D world space, allowing incorporation of 3D geometric and semantic cues for prediction in 3D regular spaces with differentiable transformation mechanisms.
End-to-End Pipeline: By using stereo images, DSGN successfully integrates pixel-level features for stereo matching and high-level features necessary for object recognition, achieving this along with depth estimation and incorporating semantic feature extraction jointly.
Network Architecture: The model adopts a modified architecture capable of extracting robust semantic features and achieving stereo correspondence – a critical aspect of depth estimation.

Numerical Results and Claims

The authors substantiate their claims with numerical evidence, asserting that DSGN outperforms contemporary methods by a significant margin in terms of Average Precision, in a domain where improvement percentages can denote substantial advancements. Notably, DSGN improves upon several stereo-based detectors and LiDAR-equipped detectors, reflecting its efficacy across multiple challenging scenarios.

Implications and Future Directions

The implications of this research are profound, as the ability to effectively integrate stereo information could lead to cost-effective alternatives to LiDAR in certain scenarios, especially where dense stereo correspondences may be viable.

The work also reveals potential paths for future research. Innovations may further consider refining volumetric transformations, evaluating model efficiency, and augmenting feature extraction chains. Exploring the detailed interactions between stereo matching and object detection accuracy can provide core insights into building more robust networks.

Conclusion

While the DSGN makes significant contributions, further exploration is needed to efficiently leverage stereo-based detection in real-world applications. Yet the demonstrated advancements provide a promising platform for developing economically viable 3D detection systems, bringing stereo vision methods closer to the performance standards traditionally set by LiDAR-based approaches. The research marks a pivotal step toward democratizing 3D object detection technology, aligning it with practical applications that demand efficient, reliable detection capabilities without the high costs associated with LiDAR sensors.

PDF Markdown