BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework (2205.13790v3)

Published 27 May 2022 in cs.CV

Abstract: Fusing the camera and LiDAR information has become a de-facto standard for 3D object detection tasks. Current methods rely on point clouds from the LiDAR sensor as queries to leverage the feature from the image space. However, people discovered that this underlying assumption makes the current fusion framework infeasible to produce any prediction when there is a LiDAR malfunction, regardless of minor or major. This fundamentally limits the deployment capability to realistic autonomous driving scenarios. In contrast, we propose a surprisingly simple yet novel fusion framework, dubbed BEVFusion, whose camera stream does not depend on the input of LiDAR data, thus addressing the downside of previous methods. We empirically show that our framework surpasses the state-of-the-art methods under the normal training settings. Under the robustness training settings that simulate various LiDAR malfunctions, our framework significantly surpasses the state-of-the-art methods by 15.7% to 28.9% mAP. To the best of our knowledge, we are the first to handle realistic LiDAR malfunction and can be deployed to realistic scenarios without any post-processing procedure. The code is available at https://github.com/ADLab-AutoDrive/BEVFusion.

Authors (9)

Tingting Liang (17 papers)
Hongwei Xie (13 papers)
Kaicheng Yu (40 papers)
Zhongyu Xia (5 papers)
Zhiwei Lin (42 papers)
Yongtao Wang (43 papers)
Tao Tang (88 papers)
Bing Wang (246 papers)
Zhi Tang (32 papers)

Citations (320)

View on Semantic Scholar

Summary

The paper introduces a decoupled framework that processes camera and LiDAR streams independently, ensuring robust 3D detection even under sensor failures.
It employs Lift-Splat-Shoot for camera data and PointPillars/CenterPoint for LiDAR data to efficiently transform inputs into bird’s eye view representations.
BEVFusion achieves up to 28.9% mAP improvement on nuScenes, outperforming state-of-the-art methods and demonstrating exceptional resilience in adverse conditions.

Overview of "BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework"

The paper "BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework" by Liang et al. introduces a novel LiDAR-camera fusion framework aimed at enhancing 3D object detection tasks in autonomous driving. The proposed system, BEVFusion, addresses the limitations of existing frameworks that are heavily reliant on LiDAR input, making them vulnerable to LiDAR malfunctions.

Key Contributions

Decoupling Dependencies: The BEVFusion framework innovatively separates the camera and LiDAR input streams, allowing them to function independently. This design is crucial for maintaining detection capabilities when LiDAR data is unavailable or corrupted.
Performance Improvement: The authors demonstrate that BEVFusion surpasses state-of-the-art methods under both normal and robustness training scenarios. Specifically, the framework improves performance by significant margins (15.7% to 28.9% mAP) under conditions simulating LiDAR malfunctions.
Framework Generalization: The framework can incorporate various known architectures for individual modalities, enhancing its adaptability and application scope. It utilizes Lift-Splat-Shoot for the camera stream and popular models like PointPillars and CenterPoint for the LiDAR stream.
Robustness Under Malfunctions: BEVFusion shows impressive resilience against LiDAR failures, outperforming existing methods when faced with limited LiDAR fields-of-view or when objects fail to reflect LiDAR points.

Methodology

The architecture of BEVFusion comprises two independent streams for processing camera and LiDAR inputs. Each stream transforms the raw sensor data into BEV (bird's eye view) space features, enabling a later fusion that enhances the detection capability.

Camera Stream: It uses an adapted version of Lift-Splat-Shoot to transform multi-view image features from the 2D space into the 3D BEV space.
LiDAR Stream: PointPillars or CenterPoint methodologies are employed to convert LiDAR point clouds into BEV representations.
Fusion Module: A dynamic fusion module integrates the features from both streams, using channel and spatial fusion mechanisms to generate robust, unified representations suitable for downstream tasks.

Experimental Results

On the nuScenes dataset, BEVFusion outperformed contemporary fusion methods, achieving a high degree of robustness. Detailed experiments highlighted the framework's ability to maintain high precision even when LiDAR inputs were compromised.

In terms of mAP, BEVFusion improved over the LiDAR-only baseline by margins as high as 28.9%.
The application of data augmentation techniques further underscored its robustness, allowing it to adapt effectively to scenarios simulating sensor failures.

Implications and Future Directions

The implications of BEVFusion are significant for the field of autonomous driving. By decoupling camera and LiDAR modalities, the system ensures reliable performance across diverse operational conditions. This approach paves the way for flexible deployment in real-world applications where sensor reliability may fluctuate.

Looking ahead, the integration of more sophisticated temporal and spatial alignment techniques may further enhance performance. Additionally, expanding the model's capability to handle other sensor modalities could also broaden its applicability.

Conclusion

BEVFusion represents a substantial advancement in LiDAR-camera fusion frameworks, providing a robust solution to a longstanding problem of sensor dependency. Its ability to function effectively under various failure conditions without complex post-processing makes it a versatile choice for autonomous systems. Future research could focus on optimizing computational efficiency and exploring scalability to additional sensor types.

PDF Markdown

Related Papers

GitHub

GitHub - ADLab-AutoDrive/BEVFusion: Offical PyTorch implementation of "BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework" (682 stars)