M$^2$BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation (2204.05088v2)

Published 11 Apr 2022 in cs.CV

Abstract: In this paper, we propose M$^2$BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs. Unlike the majority of previous works which separately process detection and segmentation, M$^2$BEV infers both tasks with a unified model and improves efficiency. M$^2$BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in ego-car coordinates. Such BEV representation is important as it enables different tasks to share a single encoder. Our framework further contains four important designs that benefit both accuracy and efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that reinforces with larger weights for more distant predictions, and (4) Large-scale 2D detection pre-training and auxiliary supervision. We show that these designs significantly benefit the ill-posed camera-based 3D perception tasks where depth information is missing. M$^2$BEV is memory efficient, allowing significantly higher resolution images as input, with faster inference speed. Experiments on nuScenes show that M$^2$BEV achieves state-of-the-art results in both 3D object detection and BEV segmentation, with the best single model achieving 42.5 mAP and 57.0 mIoU in these two tasks, respectively.

Authors (8)

Enze Xie (84 papers)
Zhiding Yu (94 papers)
Daquan Zhou (47 papers)
Jonah Philion (15 papers)
Anima Anandkumar (236 papers)
Sanja Fidler (184 papers)
Ping Luo (340 papers)
Jose M. Alvarez (90 papers)

Citations (162)

View on Semantic Scholar

Summary

Overview of M²BEV: Multi-Camera Joint 3D Detection and Segmentation

"M²BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Bird's-Eye View Representation" presents a novel framework designed to enhance 3D perception in autonomous vehicles through the integration of joint 3D object detection and map segmentation using a unified Bird's-Eye View (BEV) representation. This paper introduces several innovative techniques to optimize computational efficiency, accuracy, and scalability in multi-camera setups, addressing key challenges in the domain of autonomous driving.

Core Contributions and Methodology

The M²BEV model sets itself apart by its holistic approach to 3D perception, emphasizing the importance of a unified BEV representation derived from multi-camera image inputs. Unlike other approaches that independently process detection and segmentation tasks, M²BEV integrates these tasks under a single model framework, thereby reducing redundancy and improving efficiency.

Unified Framework: M²BEV converts multi-view 2D image features into 3D BEV features within the ego-car's coordinate system. This transformation is a cornerstone for a unified perception system, allowing both detection and segmentation to utilize a shared encoder, enhancing computational efficiency significantly.
Efficient BEV Encoder: By employing an efficient BEV encoder that limits the spatial dimension of voxel feature maps, M²BEV achieves substantial resource savings without compromising performance reliability.
Dynamic Box Assignment: The framework incorporates a dynamic box assignment mechanism, employing a learning-to-match strategy to allocate ground-truth 3D boxes effectively, improving predictive accuracy in challenging depth estimation scenarios.
BEV Centerness Re-Weighting: M²BEV introduces a centerness re-weighting method that accentuates predictions at greater distances from the vehicle, addressing the challenges posed by long-range perception tasks.
Large-Scale Pre-Training: The system is bolstered by extensive 2D detection pre-training alongside auxiliary supervision, utilizing large datasets such as nuImages to significantly enhance its 3D task performance and label efficiency.

Empirical Evaluation

Experiments conducted on the nuScenes dataset demonstrate M²BEV achieving state-of-the-art performance in both 3D object detection and BEV segmentation, recording 42.5 mAP and 57.0 mIoU respectively. This performance underscores M²BEV's capability to outperform contemporary models across key benchmarks, including mAP and nuScenes detection score (NDS).

Critical Analysis and Implications

M²BEV's introduction of a unified BEV representation marks a significant step forward in the integration of multi-task perception systems for autonomous driving. By addressing inefficiencies inherent in separate task-specific networks, M²BEV paves the way for streamlined perception systems that can be more readily scaled and refined for real-world applications.

The integration of novel design methodologies such as dynamic box assignments and BEV-specific enhancements reflects a sophisticated understanding of 3D perception challenges, particularly where traditional LiDAR-based methods are not feasible.

Future Directions

The M²BEV framework invites further exploration into its adaptability across different perception tasks beyond 3D detection and segmentation, such as trajectory forecasting and planning. Its current limitations, particularly in addressing complex road scenarios and calibration errors, suggest avenues for future refinements. Moreover, research could delve into the potential for extending the current approach to incorporate temporal information, facilitating advancements in tasks like 3D object tracking and motion prediction.

By laying the groundwork with M²BEV, this paper contributes a robust foundation for next-generation multi-camera perception systems in autonomous vehicles, presenting an opportunity to rethink conventional methodologies and embrace more unified, efficient solutions in AI-driven vehicle autonomy.

PDF Markdown