SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos (2308.09244v2)

Published 18 Aug 2023 in cs.CV

Abstract: Camera-based 3D object detection in BEV (Bird's Eye View) space has drawn great attention over the past few years. Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation cost. On the other side, sparse detectors follow a query-based paradigm without explicit dense BEV feature construction, but achieve worse performance than the dense counterparts. In this paper, we find that the key to mitigate this performance gap is the adaptability of the detector in both BEV and image space. To achieve this goal, we propose SparseBEV, a fully sparse 3D object detector that outperforms the dense counterparts. SparseBEV contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) adaptive spatio-temporal sampling to generate sampling locations under the guidance of queries, and (3) adaptive mixing to decode the sampled features with dynamic weights from the queries. On the test split of nuScenes, SparseBEV achieves the state-of-the-art performance of 67.5 NDS. On the val split, SparseBEV achieves 55.8 NDS while maintaining a real-time inference speed of 23.5 FPS. Code is available at https://github.com/MCG-NJU/SparseBEV.

Citations (74)

View on Semantic Scholar

Summary

The paper introduces SparseBEV, a fully sparse 3D detection framework leveraging adaptive self-attention to efficiently aggregate multi-scale features in BEV space.
The paper employs adaptive spatio-temporal sampling and dynamic mixing to align multi-frame features and decode robust object representations.
The paper demonstrates state-of-the-art performance on the nuScenes dataset, achieving 55.8 NDS with real-time inference speed and robust adaptability.

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos

The paper presented in this paper explores the domain of camera-based 3D object detection, particularly focusing on the Bird's Eye View (BEV) space, which has garnered notable interest due to its advantages over traditional LiDAR-based methods, notably in terms of cost and range. The paper introduces SparseBEV, a novel fully sparse 3D object detection framework that aims to bridge the performance gap between sparse detectors and their dense counterparts, with a particular emphasis on efficiency and adaptability in BEV and image space.

The sparse detection paradigm traditionally follows a query-based approach, aiming to circumvent the computational overhead associated with dense BEV feature construction. However, this method has historically lagged in performance. SparseBEV addresses this challenge through three innovative components:

Scale-Adaptive Self Attention (SASA): This mechanism enables the aggregation of features through an adaptive receptive field in BEV space. Unlike traditional BEV encoders, SASA leverages self-attention with a variably scaled receptive field guided by query features. The approach ensures efficient multi-scale feature aggregation, mitigating issues seen in vanilla multi-head self-attention.
Adaptive Spatio-Temporal Sampling: SparseBEV circumvents the single-point sampling limitation by introducing a mechanism that adaptively generates sampling locations based on 3D queries. Each query informs its sampling strategy, accommodating objects of various scales and dynamics, and allowing for effective temporal alignment in multi-frame inputs through ego-motion and velocity-based corrections.
Adaptive Mixing: This component utilizes dynamic queries to decode spatio-temporal sampled features using weights generated from the queries themselves. This enables an efficient transformation of sampled data into predictive outputs, integrating channel and point mixing to enhance semantic object representation.

The empirical validation of SparseBEV on the nuScenes dataset demonstrates its state-of-the-art performance, achieving a notable 55.8 NDS on the validation set with a real-time inference speed. The framework also shows flexibility, adapting to different input configurations without performance degradation, indicating its robustness in real-world conditions.

One of the bold claims made by the authors is the potential of SparseBEV to match or surpass the accuracy of dense 3D object detectors, a significant assertion given the computational savings afforded by the sparse query-based paradigm. This positions SparseBEV as a compelling solution for autonomous driving applications where computational efficiency and adaptability to diverse operational environments are crucial.

While SparseBEV sets a new benchmark in camera-based 3D object detection, the paper also acknowledges limitations, particularly concerning its reliance on accurate ego-pose estimates, which may not always be available or reliable in real-world scenarios. The paper aptly identifies future directions, emphasizing the need for further decoupling of spatial and temporal information and exploring broader applications in 3D spatial perception tasks.

In conclusion, SparseBEV represents a significant advancement in achieving high-performance, fully sparse 3D object detection from multi-camera inputs, effectively balancing accuracy and computational efficiency. Its adaptability and performance offer promising avenues for further research and application in advanced driver-assistance systems and autonomous vehicle technology.

PDF Markdown

Related Papers

GitHub

GitHub - MCG-NJU/SparseBEV: [ICCV 2023] SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos (292 stars)