Fully Sparse Fusion for 3D Object Detection (2304.12310v3)

Published 24 Apr 2023 in cs.CV

Abstract: Currently prevalent multimodal 3D detection methods are built upon LiDAR-based detectors that usually use dense Bird's-Eye-View (BEV) feature maps. However, the cost of such BEV feature maps is quadratic to the detection range, making it not suitable for long-range detection. Fully sparse architecture is gaining attention as they are highly efficient in long-range perception. In this paper, we study how to effectively leverage image modality in the emerging fully sparse architecture. Particularly, utilizing instance queries, our framework integrates the well-studied 2D instance segmentation into the LiDAR side, which is parallel to the 3D instance segmentation part in the fully sparse detector. This design achieves a uniform query-based fusion framework in both the 2D and 3D sides while maintaining the fully sparse characteristic. Extensive experiments showcase state-of-the-art results on the widely used nuScenes dataset and the long-range Argoverse 2 dataset. Notably, the inference speed of the proposed method under the long-range LiDAR perception setting is 2.7 $\times$ faster than that of other state-of-the-art multimodal 3D detection methods. Code will be released at \url{https://github.com/BraveGroup/FullySparseFusion}.

Citations (17)

View on Semantic Scholar

Summary

The paper introduces a fully sparse fusion framework that integrates 2D and 3D instance segmentation to enhance multi-modal 3D object detection.
It eliminates reliance on dense BEV feature maps, significantly reducing computational overhead while achieving state-of-the-art performance.
The dual-stage assignment strategy effectively aligns LiDAR and camera instances, improving object localization in challenging long-range scenarios.

Fully Sparse Fusion for 3D Object Detection

The paper presents a novel approach, Fully Sparse Fusion (FSF), for enhancing multi-modal 3D object detection through an entirely sparse architecture. In the field of autonomous driving, efficient and effective 3D object detection is vital, typically achieved using LiDAR and camera sensors. The current landscape predominantly relies on dense detectors that utilize dense Bird's-Eye-View (BEV) feature maps; however, such maps are computationally expensive, particularly when scaled for extended detection ranges. The emergence of fully sparse LiDAR-only architectures offers a promising solution for high efficiency and effectiveness. Yet, the integration of these architectures into robust multi-modal frameworks remains inadequately addressed.

The FSF framework transcends the limitations of LiDAR-only models by amalgamating well-established 2D instance segmentation with 3D instance segmentation, thereby creating a fully sparse detector capable of leveraging multi-modal sensor data. This methodology circumvents the dependency on dense BEV feature maps, reducing computational overhead while maintaining robust detection capabilities. The framework is structured into two primary modules: Bi-modal Instance Generation and Bi-modal Instance-based Prediction.

Bi-modal Instance Generation involves generating LiDAR and camera-derived instances. The LiDAR instances stem from 3D instance segmentation, which encompasses foreground point extraction and clustering based on spatial proximity. On the camera side, instances are crafted using 2D instance segmentation outputs, which are transformed into 3D frustums within which corresponding LiDAR points are gathered. This dual-instance approach allows FSF to harness both depth information from LiDAR and semantic richness from cameras, accommodating scenarios where LiDAR-only detection might falter.

Bi-modal Instance-based Prediction enhances detection performance by aligning instance shapes and fusing the information from both modalities. After generating initial instances, this module refines them by aligning their shapes to facilitate robust feature extraction and interaction for high-quality object localization. A two-stage assignment strategy is employed to associate instances with ground-truth bounding boxes comprehensively. This strategy features a combination of LiDAR-based 3D center alignment and camera-based 2D similarity assessments, addressing discrepancies in the spatial distribution of bi-modal instances.

The paper reports state-of-the-art performances across popular datasets such as nuScenes, Waymo Open Dataset, and Argoverse 2, underlining FSF's superior efficiency and accuracy even in long-range perception scenarios. FSF excels notably in handling scenarios where traditional methods may struggle, such as objects with sparse LiDAR points or those closely positioned to other entities.

The FSF framework stands as a pivotal advancement in the trajectory towards achieving more efficient and scalable multi modal 3D object detection systems. By eliminating dense BEV feature maps and harnessing full sparsity, it showcases the feasibility of achieving rapid, high-fidelity environmental perception in dynamic and complex settings, indispensable for autonomous driving technology.

Future research could explore optimizing the two-stage assignment strategy and further exploring bi-modal interaction mechanisms. Additionally, aligning FSF's paradigm with other emerging sparse computation techniques could potentially pave the way for unprecedented breakthroughs in data-efficient 3D object detection across various domains.

PDF Markdown

Related Papers

GitHub

GitHub - BraveGroup/FullySparseFusion: Fully Sparse Fusion for 3D Object Detection (102 stars)