PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest (2403.09212v2)
Abstract: In this work, we present PoIFusion, a conceptually simple yet effective multi-modal 3D object detection framework to fuse the information of RGB images and LiDAR point clouds at the points of interest (PoIs). Different from the most accurate methods to date that transform multi-sensor data into a unified view or leverage the global attention mechanism to facilitate fusion, our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and interpolation. In particular, our PoIFusion follows the paradigm of query-based object detection, formulating object queries as dynamic 3D boxes and generating a set of PoIs based on each query box. The PoIs serve as the keypoints to represent a 3D object and play the role of the basic units in multi-modal fusion. Specifically, we project PoIs into the view of each modality to sample the corresponding feature and integrate the multi-modal features at each PoI through a dynamic fusion block. Furthermore, the features of PoIs derived from the same query box are aggregated together to update the query feature. Our approach prevents information loss caused by view transformation and eliminates the computation-intensive global attention, making the multi-modal 3D object detector more applicable. We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach. Remarkably, the proposed approach achieves state-of-the-art results on both datasets without any bells and whistles, \emph{i.e.}, 74.9\% NDS and 73.4\% mAP on nuScenes, and 31.6\% CDS and 40.6\% mAP on Argoverse2. The code will be made available at \url{https://djiajunustc.github.io/projects/poifusion}.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- M3d-rpn: Monocular 3d region proposal network for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Objectfusion: Multi-modal 3d object detection with object-centric fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
- Multi-View 3D Object Detection Network for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- FUTR3D: A unified sensor fusion framework for 3d detection. arXiv preprint arXiv:2203.10642, 2022a.
- Focal Sparse Convolutional Networks for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
- Scaling up kernels in 3D CNNs. arXiv preprint arXiv:2206.10555, 2022c.
- VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
- Focalformer3d: Focusing on hard instance for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023b.
- Autoalign: Pixel-instance feature aggregation for multi-modal 3d object detection. Proceedings of the European Conference on Computer Vision (ECCV), 2022d.
- Monodistill: Learning spatial features for monocular 3d object detection. arXiv preprint arXiv:2201.10830, 2022.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Voxel R-CNN: Towards High Performance Voxel-based 3D Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2021a.
- From multi-view to hollow-3d: Hallucinated hollow-3d r-cnn for 3d object detection. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 2021b.
- Mssvt: Mixed-scale sparse voxel transformer for 3d object detection on point clouds. Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Embracing Single Stride 3D Object Detector with Sparse Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022a.
- Fully Sparse 3D Object Detection. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022b.
- Adamixer: A fast-converging query-based object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Metabev: Solving sensor failures for 3d detection and map segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Mask R-CNN. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017.
- Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Msmdfusion: Fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Spherical transformer for lidar-based 3d recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- PointPillars: Fast Encoders for Object Detection from Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Unifying voxel-based representation with transformer for 3d object detection. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022a.
- DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
- Fully sparse fusion for 3d object detection. arXiv preprint arXiv:2304.12310, 2023a.
- BevDepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2023b.
- BevFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceeding of the 16th European Conference on Computer Vision (ECCV), 2022c.
- BEVFusion: A simple and robust lidar-camera fusion framework. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017a.
- Focal Loss for Dense Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2017b.
- Detr does not need multi-scale or locality design. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Sparsebev: High-performance sparse 3d object detection from multi-camera videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023a.
- Dab-detr: Dynamic anchor boxes are better queries for detr. In International Conference on Learning Representations (ICLR), 2022a.
- PETR: Position embedding transformation for multi-view 3d object detection. In Proceeding of the 16th European Conference on Computer Vision (ECCV), 2022b.
- Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In Proceeding of the IEEE International Conference on Robotics and Automation (ICRA), 2023b.
- Geometry uncertainty projection network for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- Lift, Splat, Shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In Proceeding of the 16th European Conference on Computer Vision (ECCV), 2020.
- PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017a.
- PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2017b.
- Categorical depth distribution network for monocular 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Improving 3d object detection with channel-wise transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
- PillarNet: High-Performance Pillar-based 3D Object Detection. arXiv preprint arXiv:2205.07403, 2022.
- PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector Representation for 3D Object Detection. arXiv preprint arXiv:2102.00463, 2021.
- 3dppe: 3d point positional encoding for transformer-based multi-camera 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Raft: Recurrent all-pairs field transforms for optical flow. In Proceeding of the 16th European Conference on Computer Vision (ECCV), 2020.
- Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2017.
- PointPainting: Sequential Fusion for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Pointaugmenting: Cross-modal augmentation for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.
- Dsvt: Dynamic sparse voxel transformer with rotated sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
- Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023b.
- Fcos3d: Fully convolutional one-stage monocular 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021b.
- Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- DETR3D: 3D object detection from multi-view images via 3d-to-2d queries. In Proceedings of the Conference on Robot Learning (CoRL), 2022.
- Multi-modal 3d object detection in autonomous driving: a survey. International Journal of Computer Vision (IJCV), 2023c.
- Object as query: Lifting any 2d object detector to 3d detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023d.
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2017.
- Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Cape: Camera view position embedding for multi-view 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Cross modal transformer: Towards fast and robust 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- SECOND: Sparsely Embedded Convolutional Detection. Sensors, 18(10), 2018.
- Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Towards efficient 3d object detection with knowledge distillation. Advances in Neural Information Processing Systems (NeurIPS), 2022a.
- STD: Sparse-to-Dense 3D Object Detector for Point Cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- 3DSSD: Point-based 3D Single Stage Object Detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- DeepInteraction: 3D object detection via modality interaction. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2022b.
- Center-based 3D Object Detection and Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021a.
- Multimodal virtual point 3d detection. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), 2021b.
- 3D-CVF: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection. In Proceeding of the 16th European Conference on Computer Vision (ECCV), 2020.
- Monodetr: Depth-guided transformer for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- CenterFormer: Center-based Transformer for 3D Object Detection. In Proceeding of the 16th European Conference on Computer Vision (ECCV), 2022.
- Class-balanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492, 2019.
- Vpfnet: Improving 3d object detection with virtual point based lidar and stereo data fusion. IEEE Transactions on Multimedia, 2022.
- Deformable DETR: Deformable Transformers for End-to-End Object Detection. In Proceedings of the IEEE International Conference on Robotics and Automation (ICLR), 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Collections
Sign up for free to add this paper to one or more collections.