TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers (2203.11496v1)

Published 22 Mar 2022 in cs.CV

Abstract: LiDAR and camera are two important sensors for 3D object detection in autonomous driving. Despite the increasing popularity of sensor fusion in this field, the robustness against inferior image conditions, e.g., bad illumination and sensor misalignment, is under-explored. Existing fusion methods are easily affected by such conditions, mainly due to a hard association of LiDAR points and image pixels, established by calibration matrices. We propose TransFusion, a robust solution to LiDAR-camera fusion with a soft-association mechanism to handle inferior image conditions. Specifically, our TransFusion consists of convolutional backbones and a detection head based on a transformer decoder. The first layer of the decoder predicts initial bounding boxes from a LiDAR point cloud using a sparse set of object queries, and its second decoder layer adaptively fuses the object queries with useful image features, leveraging both spatial and contextual relationships. The attention mechanism of the transformer enables our model to adaptively determine where and what information should be taken from the image, leading to a robust and effective fusion strategy. We additionally design an image-guided query initialization strategy to deal with objects that are difficult to detect in point clouds. TransFusion achieves state-of-the-art performance on large-scale datasets. We provide extensive experiments to demonstrate its robustness against degenerated image quality and calibration errors. We also extend the proposed method to the 3D tracking task and achieve the 1st place in the leaderboard of nuScenes tracking, showing its effectiveness and generalization capability.

PDF Abstract

Overview of "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers"

The paper, "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers," addresses the integration of LiDAR and camera data for enhancing 3D object detection within autonomous driving applications. The authors propose a novel fusion mechanism that alleviates issues stemming from noise and misalignment in inferior image conditions.

Key Contributions and Methodology

The core contribution of the paper is the introduction of a soft-association mechanism that leverages a transformer-based architecture for robust LiDAR-camera fusion. This approach diverges from traditional hard associations, typically dependent on calibration matrices, thereby mitigating vulnerabilities to alignment errors and poor lighting conditions.

TransFusion Architecture:

Transformer-Based Detector:
- The paper employs two layers of transformer decoders. The initial layer generates 3D bounding box predictions from LiDAR data using sparse object queries. The subsequent layer adaptively fuses these queries with dynamically selected image features, enhancing robustness and detection efficacy.
Soft-Association Mechanism:
- By using attention mechanisms inherent in transformers, the model preserves adaptability in selecting relevant image features based on contextual and spatial relationships, significantly improving performance under degraded image quality.
Image-Guided Query Initialization:
- To further improve detection, an image-guided strategy is employed to initialize object queries better. This involves projecting image features onto bird's eye view (BEV) spaces to enhance query efficacy, especially for objects difficult to detect in LiDAR data alone.

Experimental Results

The proposed TransFusion model is validated on large-scale datasets including nuScenes and Waymo, achieving state-of-the-art performance. Notably, TransFusion demonstrated superior robustness to degraded image conditions compared to existing fusion methods.

Detection Accuracy: TransFusion consistently achieved higher mAP and NDS scores compared to leading LiDAR-only and LiDAR-camera methods.
Robustness: The methodology showcased resilience against calibration errors and low-quality image scenarios, maintaining high detection performance where traditional methods faltered.

Implications and Future Directions

The innovative fusion strategy outlined in the paper has notable implications for the development of multi-modal perception systems in autonomous vehicles. By addressing the limitations of previous fusion methods, TransFusion enhances the reliability and accuracy of 3D object detection systems.

Theoretical Implications:

The structure of TransFusion encourages further exploration into transformer architectures for multi-modal data integration, potentially influencing future developments in sensor fusion technologies beyond autonomous vehicles.

Practical Implications:

From a practical standpoint, the robustness of TransFusion against sensor imperfections and environmental challenges suggests its utility in real-world autonomous driving applications, where varied lighting and misalignments are prevalent.

Future Developments:

Ongoing investigations could extend the TransFusion framework to other sensor modalities and explore the application of the soft-association strategy to additional tasks such as 3D segmentation or real-time tracking in dynamic environments.

In summary, TransFusion provides a significant advancement in the field of 3D object detection through innovative use of transformer technologies, setting a new benchmark for LiDAR-camera fusion methodologies.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Xuyang Bai (12 papers)
Zeyu Hu (12 papers)
Xinge Zhu (62 papers)
Qingqiu Huang (17 papers)
Yilun Chen (48 papers)
Hongbo Fu (67 papers)
Chiew-Lan Tai (12 papers)

Citations (493)

View on Semantic Scholar

TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers (2203.11496v1)

Overview of "TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers"

Key Contributions and Methodology

Experimental Results

Implications and Future Directions

Related Papers