DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection (2203.08195v1)

Published 15 Mar 2022 in cs.CV

Abstract: Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving. While prevalent multi-modal methods simply decorate raw lidar point clouds with camera features and feed them directly to existing 3D detection models, our study shows that fusing camera features with deep lidar features instead of raw points, can lead to better performance. However, as those features are often augmented and aggregated, a key challenge in fusion is how to effectively align the transformed features from two modalities. In this paper, we propose two novel techniques: InverseAug that inverses geometric-related augmentations, e.g., rotation, to enable accurate geometric alignment between lidar points and image pixels, and LearnableAlign that leverages cross-attention to dynamically capture the correlations between image and lidar features during fusion. Based on InverseAug and LearnableAlign, we develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods. For example, DeepFusion improves PointPillars, CenterPoint, and 3D-MAN baselines on Pedestrian detection for 6.7, 8.9, and 6.2 LEVEL_2 APH, respectively. Notably, our models achieve state-of-the-art performance on Waymo Open Dataset, and show strong model robustness against input corruptions and out-of-distribution data. Code will be publicly available at https://github.com/tensorflow/lingvo/tree/master/lingvo/.

PDF Abstract

DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection

The paper "DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection" presents a sophisticated approach to enhance 3D object detection in autonomous driving by integrating data from lidar and camera sensors. This research focuses on refining the fusion of camera features with deep lidar features, deviating from previous practices of merely decorating lidar point clouds at the input level. The authors introduce innovative techniques, InverseAug and LearnableAlign, to address the critical challenge of effectively aligning transformed features across modalities for improved detection performance.

Contribution and Techniques

The paper outlines the limitations of prevalent multi-modal methods that fuse camera features with lidar at the raw input stage and proposes a more refined approach. The authors argue that fusing features at the deep feature level, although challenging due to alignment issues, can yield significantly better results. They propose two novel techniques:

InverseAug: This technique addresses misalignment caused by geometric-related augmentations such as rotation. By reversing these augmentations, it enables accurate alignment between lidar points and camera pixels.
LearnableAlign: Leveraging cross-attention, this technique dynamically captures correlations between image and lidar features, enhancing alignment quality during the fusion process.

These techniques form the foundation of a family of multi-modal 3D detection models termed "DeepFusion," which are demonstrated to be more accurate and robust than their predecessors.

Numerical Results and Performance

The DeepFusion models have been rigorously tested and show substantial improvements over existing methods. On the Waymo Open Dataset, remarkable advancements in Pedestrian detection are reported: improvements of 6.7, 8.9, and 6.2 LEVEL_2 APH for PointPillars, CenterPoint, and 3D-MAN baselines, respectively. The models achieve state-of-the-art results, illustrating robustness against input corruptions and out-of-distribution data, with specific gains highlighted for long-range object detection.

Implications and Future Directions

The implications of this research are multifaceted. Practically, the improvements in accuracy and robustness afford significant enhancements in the real-world deployments of autonomous driving systems. Theoretically, the introduction of feature-level alignment techniques could inspire new lines of inquiry into cross-modal correlation mechanisms.

Future developments could explore the integration of additional modalities or the refinement of alignment techniques to handle even more complex transformations and interactions between sensor inputs. There is also potential for adapting these methods into different applications beyond autonomous vehicles, such as robotics and augmented reality.

Conclusion

The paper presents a compelling argument for deep feature-level fusion in multi-modal 3D detection systems, substantiated by robust experimentation and significant performance improvements. By proposing InverseAug and LearnableAlign techniques, the authors address a critical challenge in sensor fusion, positioning their work as a valuable contribution to the field of computer vision and autonomous systems.

PDF Markdown Bookmark Chat (Pro)

Authors (13)

Yingwei Li (31 papers)
Adams Wei Yu (23 papers)
Tianjian Meng (9 papers)
Ben Caine (7 papers)
Jiquan Ngiam (17 papers)
Daiyi Peng (17 papers)
Junyang Shen (1 paper)
Bo Wu (144 papers)
Yifeng Lu (16 papers)
Denny Zhou (65 papers)
Quoc V. Le (128 papers)
Alan Yuille (294 papers)
Mingxing Tan (46 papers)

Citations (284)

View on Semantic Scholar

DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection (2203.08195v1)