SparseFusion: Efficient Sparse Multi-Modal Fusion Framework for Long-Range 3D Perception (2403.10036v1)
Abstract: Multi-modal 3D object detection has exhibited significant progress in recent years. However, most existing methods can hardly scale to long-range scenarios due to their reliance on dense 3D features, which substantially escalate computational demands and memory usage. In this paper, we introduce SparseFusion, a novel multi-modal fusion framework fully built upon sparse 3D features to facilitate efficient long-range perception. The core of our method is the Sparse View Transformer module, which selectively lifts regions of interest in 2D image space into the unified 3D space. The proposed module introduces sparsity from both semantic and geometric aspects which only fill grids that foreground objects potentially reside in. Comprehensive experiments have verified the efficiency and effectiveness of our framework in long-range 3D perception. Remarkably, on the long-range Argoverse2 dataset, SparseFusion reduces memory footprint and accelerates the inference by about two times compared to dense detectors. It also achieves state-of-the-art performance with mAP of 41.2% and CDS of 32.1%. The versatility of SparseFusion is also validated in the temporal object detection task and 3D lane detection task. Codes will be released upon acceptance.
- TransFusion: Robust lidar-camera fusion for 3d object detection with transformers. In CVPR, pages 1090–1099, 2022.
- nuScenes: A multimodal dataset for autonomous driving. In CVPR, pages 11621–11631, 2020.
- Bevfusion4d: Learning lidar-camera fusion under bird’s-eye-view via cross-modality guidance and temporal aggregation. arXiv preprint arXiv:2303.17099, 2023.
- PersFormer: 3d lane detection via perspective transformer and the openlane benchmark. In ECCV, 2022.
- FUTR3D: A unified sensor fusion framework for 3d detection. In CVPR, pages 172–181, 2023a.
- VoxelNeXt: Fully sparse voxelnet for 3d object detection and tracking. In CVPR, 2023b.
- Embracing Single Stride 3D Object Detector with Sparse Transformer. In CVPR, 2022a.
- Fully Sparse 3D Object Detection. In NeurIPS, 2022b.
- FSD V2: Improving fully sparse 3d object detection with virtual voxels. arXiv preprint arXiv:2308.03755, 2023.
- 3D-LaneNet: end-to-end 3d multiple lane detection. In ICCV, pages 2921–2930, 2019.
- MetaBEV: Solving sensor failures for 3d detection and map segmentation. In ICCV, pages 8721–8731, 2023.
- 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, pages 9224–9232, 2018.
- Exploring recurrent long-term temporal fusion for multi-view 3d perception. arXiv preprint arXiv:2303.05970, 2023.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- FusionFormer: A multi-sensory fusion in bird’s-eye-view and temporal consistent transformer for 3d objection. arXiv preprint arXiv:2309.05257, 2023a.
- EA-LSS: Edge-aware lift-splat-shot framework for 3d bev object detection. arXiv preprint arXiv:2303.17895, 2023b.
- Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054, 2022a.
- BEVPoolv2: A cutting-edge implementation of bevdet toward deployment. arXiv preprint arXiv:2211.17111, 2022b.
- BEVDet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Anchor3DLane: Learning to regress 3d anchors for monocular 3d lane detection. In CVPR, 2023.
- Far3D: Expanding the horizon for surround-view 3d object detection. arXiv preprint arXiv:2308.09616, 2023.
- PointPillars: Fast encoders for object detection from point clouds. In CVPR, pages 12697–12705, 2019.
- BEVStereo: Enhancing depth estimation in multi-view 3d object detection with dynamic temporal stereo. arXiv preprint arXiv:2209.10248, 2022a.
- Fully sparse fusion for 3d object detection. arXiv preprint arXiv:2304.12310, 2023a.
- BEVDepth: Acquisition of reliable depth for multi-view 3d object detection. In AAAI, pages 1477–1485, 2023b.
- Fast-BEV: A fast and strong bird’s-eye view perception baseline. arXiv preprint arXiv:2301.12511, 2023c.
- BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18. Springer, 2022b.
- BEVFusion: A simple and robust lidar-camera fusion framework. NeurIPS, 35:10421–10434, 2022.
- Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
- Sparse4D: Multi-view 3d object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581, 2022.
- SparseBEV: High-performance sparse 3d object detection from multi-camera videos. In ICCV, pages 18580–18590, 2023a.
- PETR: Position embedding transformation for multi-view 3d object detection. In ECCV, pages 531–548. Springer, 2022a.
- PETRv2: A unified framework for 3d perception from multi-camera images. arXiv preprint arXiv:2206.01256, 2022b.
- Swin Transformer: Hierarchical vision transformer using shifted windows. In CVPR, pages 10012–10022, 2021.
- BEVFusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, pages 2774–2781. IEEE, 2023b.
- FlatFormer: Flattened window attention for efficient point cloud transformer. In CVPR, pages 1200–1211, 2023c.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- M22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT-3DLaneNet: Exploring Multi-Modal 3D Lane Detection. arXiv e-prints, art. arXiv:2209.05996, 2022.
- LATR: 3d lane detection from monocular images with transformer. In ICCV, pages 7941–7952, 2023.
- Cluster-Former: Cluster-based transformer for 3d object detection in point clouds. In ICCV, pages 6664–6673, 2023.
- Megdet: A large mini-batch object detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6181–6189, 2018.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, pages 194–210. Springer, 2020.
- PointNet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pages 652–660, 2017.
- You Only Look Once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
- Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
- Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, pages 2446–2454, 2020.
- FCOS: Fully convolutional one-stage object detection. In ICCV, 2019.
- Attention is all you need. NeurIPS, 30, 2017.
- DSVT: Dynamic sparse voxel transformer with rotated sets. In CVPR, pages 13520–13529, 2023a.
- UniTR: A unified and efficient multi-modal transformer for bird’s-eye-view representation. In ICCV, pages 6792–6802, 2023b.
- BEV-LaneDet: Fast lane detection on bev ground. arXiv preprint arXiv:2210.06006, 2022.
- UniBEV: Multi-modal 3d object detection with uniform bev encoders for robustness against missing sensor modalities. arXiv preprint arXiv:2309.14516, 2023c.
- Exploring object-centric temporal modeling for efficient multi-view 3d object detection. arXiv preprint arXiv:2303.11926, 2023d.
- DETR3D: 3d object detection from multi-view images via 3d-to-2d queries. In CoRL, 2021.
- Object as query: Equipping any 2d object detector with 3d detection ability. arXiv preprint arXiv:2301.02364, 2023e.
- Argoverse 2: Next generation datasets for self-driving perception and forecasting. In NeurIPS, 2021.
- M2bev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation. arXiv preprint arXiv:2204.05088, 2022.
- Cross Modal Transformer: Towards fast and robust 3d object detection. In ICCV, pages 18268–18278, 2023.
- SECOND: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
- BEVFormer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In CVPR, pages 17830–17839, 2023.
- Sparse point guided 3d lane detection. In ICCV, pages 8363–8372, 2023.
- Center-based 3d object detection and tracking. In CVPR, pages 11784–11793, 2021.
- SA-BEV: Generating semantic-aware bird’s-eye-view feature for multi-view 3d object detection. In ICCV, pages 3348–3357, 2023.
- VoxelNet: End-to-end learning for point cloud based 3d object detection. In CVPR, pages 4490–4499, 2018.
- End-to-end multi-view fusion for 3d object detection in lidar point clouds. In CoRL, pages 923–932. PMLR, 2020.
- Class-balanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492, 2019.
- Deformable DETR: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.