VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework for Multi-Modal 3D Object Detection (2401.02702v1)
Abstract: LiDAR-camera fusion can enhance the performance of 3D object detection by utilizing complementary information between depth-aware LiDAR points and semantically rich images. Existing voxel-based methods face significant challenges when fusing sparse voxel features with dense image features in a one-to-one manner, resulting in the loss of the advantages of images, including semantic and continuity information, leading to sub-optimal detection performance, especially at long distances. In this paper, we present VoxelNextFusion, a multi-modal 3D object detection framework specifically designed for voxel-based methods, which effectively bridges the gap between sparse point clouds and dense images. In particular, we propose a voxel-based image pipeline that involves projecting point clouds onto images to obtain both pixel- and patch-level features. These features are then fused using a self-attention to obtain a combined representation. Moreover, to address the issue of background features present in patches, we propose a feature importance module that effectively distinguishes between foreground and background features, thus minimizing the impact of the background features. Extensive experiments were conducted on the widely used KITTI and nuScenes 3D object detection benchmarks. Notably, our VoxelNextFusion achieved around +3.20% in [email protected] improvement for car detection in hard level compared to the Voxel R-CNN baseline on the KITTI test dataset
- A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 3354–3361. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/6248074
- P. Wang, L. Shi, B. Chen, Z. Hu, J. Qiao, and Q. Dong, “Pursuing 3-D scene structures with optical satellite images from affine reconstruction to Euclidean reconstruction,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022.
- X. Wu, L. Peng, H. Yang, L. Xie, C. Huang, C. Deng, H. Liu, and D. Cai, “Sparse fuse dense: Towards high quality 3d detection with depth completion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5418–5427.
- Y. Li, X. Qi, Y. Chen, L. Wang, Z. Li, J. Sun, and J. Jia, “Voxel Field Fusion for 3D Object Detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1120–1129.
- Y. Chen, Y. Li, X. Zhang, J. Sun, and J. Jia, “Focal Sparse Convolutional Networks for 3D Object Detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5428–5437.
- Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, F. Zhao, B. Zhou, and H. Zhao, “AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection,” arXiv preprint arXiv:2201.06493, 2022.
- Y. Li, A. W. Yu, T. Meng, B. Caine, J. Ngiam, D. Peng, J. Shen, Y. Lu, D. Zhou, Q. V. Le et al., “Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 182–17 191.
- H. Zhu, J. Deng, Y. Zhang, J. Ji, Q. Mao, H. Li, and Y. Zhang, “Vpfnet: Improving 3d object detection with virtual point based lidar and stereo data fusion,” IEEE Transactions on Multimedia, 2022.
- T. Huang, Z. Liu, X. Chen, and X. Bai, “Epnet: Enhancing point features with image semantics for 3d object detection,” in European Conference on Computer Vision. Springer, 2020, pp. 35–52.
- C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 918–927.
- Z. Song, H. Wei, L. Bai, L. Yang, and C. Jia, “GraphAlign: Enhancing Accurate Feature Alignment by Graph matching for Multi-Modal 3D Object Detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 3358–3369.
- Z. Song, C. Jia, L. Yang, H. Wei, and L. Liu, “GraphAlign++: An Accurate Feature Alignment by Graph Matching for Multi-Modal 3D Object Detection,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2023.
- H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
- Y. Wei, L. Zhao, W. Zheng, Z. Zhu, J. Zhou, and J. Lu, “Surroundocc: Multi-camera 3d occupancy prediction for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 729–21 740.
- A. Gao, Y. Pang, J. Nie, Z. Shao, J. Cao, Y. Guo, and X. Li, “Esgn: Efficient stereo geometry network for fast 3d object detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
- Z. Liu, Z. Wu, and R. Tóth, “Smoke: Single-stage monocular 3d object detection via keypoint estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 996–997.
- L. Yang, X. Zhang, J. Li, L. Wang, M. Zhu, and L. Zhu, “Lite-fpn for keypoint-based monocular 3d object detection,” Knowledge-Based Systems, vol. 271, p. 110517, 2023.
- L. Yang, X. Zhang, J. Li, L. Wang, M. Zhu, C. Zhang, and H. Liu, “Mix-teaching: A simple, unified and effective semi-supervised learning framework for monocular 3d object detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- L. Yang, J. Yu, X. Zhang, J. Li, L. Wang, Y. Huang, C. Zhang, H. Wang, and Y. Li, “MonoGAE: Roadside Monocular 3D Object Detection with Ground-Aware Embeddings,” arXiv preprint arXiv:2310.00400, 2023.
- L. Piccinelli, C. Sakaridis, and F. Yu, “iDisc: Internal Discretization for Monocular Depth Estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 477–21 487.
- A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka, “3D Bounding Box Estimation Using Deep Learning and Geometry,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017. [Online]. Available: http://dx.doi.org/10.1109/cvpr.2017.597
- T. Wang, X. Zhu, J. Pang, and D. Lin, “Fcos3d: Fully convolutional one-stage monocular 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 913–922.
- P. Li, H. Zhao, P. Liu, and F. Cao, “Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving,” in European Conference on Computer Vision. Springer, 2020, pp. 644–660.
- Y. Chen, L. Tai, K. Sun, and M. Li, “MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2020. [Online]. Available: http://dx.doi.org/10.1109/cvpr42600.2020.01211
- Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8445–8453.
- Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, and Z. Li, “Bevdepth: Acquisition of reliable depth for multi-view 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1477–1485.
- Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in European conference on computer vision. Springer, 2022, pp. 1–18.
- S. Shi, X. Wang, and H. Li, “PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud.” in CVPR. Computer Vision Foundation / IEEE, 2019, pp. 770–779. [Online]. Available: http://dblp.uni-trier.de/db/conf/cvpr/cvpr2019.html#ShiWL19
- T. Xie, L. Wang, K. Wang, R. Li, X. Zhang, H. Zhang, L. Yang, H. Liu, and J. Li, “FARP-Net: Local-Global Feature Aggregation and Relation-Aware Proposals for 3D Object Detection,” IEEE Transactions on Multimedia, pp. 1–15, 2023.
- Z. Yang, Y. Sun, S. Liu, and J. Jia, “3dssd: Point-based 3d single stage object detector,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 040–11 048.
- R. Q. Charles, S. Hao, M. Kaichun, and J. G. Leonidas, “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation,” pp. 77–85, 2017. [Online]. Available: https://doi.org/10.1109/CVPR.2017.16
- C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space.” in NIPS, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., 2017, pp. 5099–5108. [Online]. Available: http://dblp.uni-trier.de/db/conf/nips/nips2017.html#QiYSG17
- Y. Zhou and O. Tuzel, “VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection.” in CVPR. IEEE Computer Society, 2018, pp. 4490–4499. [Online]. Available: http://dblp.uni-trier.de/db/conf/cvpr/cvpr2018.html#ZhouT18
- Z. Song, H. Wei, C. Jia, Y. Xia, X. Li, and C. Zhang, “VP-Net: Voxels as Points for 3D Object Detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. , pp. 1–1, 2023.
- H. Wu, J. Deng, C. Wen, X. Li, C. Wang, and J. Li, “CasA: A cascade attention network for 3-D object detection from LiDAR point clouds,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022.
- C. Yu, J. Lei, B. Peng, H. Shen, and Q. Huang, “SIEV-Net: A Structure-Information Enhanced Voxel Network for 3D Object Detection From LiDAR Point Clouds,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022.
- Q. Xia, Y. Chen, G. Cai, G. Chen, D. Xie, J. Su, and Z. Wang, “3-D HANet: A Flexible 3-D Heatmap Auxiliary Network for Object Detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023.
- A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “PointPillars: Fast Encoders for Object Detection from Point Clouds.” CoRR, vol. abs/1812.05784, 2018. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1812.html#abs-1812-05784
- Y. Yan, Y. Mao, and B. Li, “SECOND: Sparsely Embedded Convolutional Detection.” Sensors, vol. 18, no. 10, p. 3337, 2018. [Online]. Available: http://dblp.uni-trier.de/db/journals/sensors/sensors18.html#YanML18
- C. He, R. Li, S. Li, and L. Zhang, “Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8417–8427.
- L. Wang, X. Zhang, Z. Song, J. Bi, G. Zhang, H. Wei, L. Tang, L. Yang, J. Li, C. Jia et al., “Multi-modal 3D Object Detection in Autonomous Driving: A Survey and Taxonomy,” IEEE Transactions on Intelligent Vehicles, 2023.
- L. Wang, Z. Song, X. Zhang, C. Wang, G. Zhang, L. Zhu, J. Li, and H. Liu, “SAT-GCN: Self-attention graph convolutional network-based 3D object detection for autonomous driving,” Knowledge-Based Systems, vol. 259, p. 110080, 2023.
- S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting: Sequential fusion for 3d object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4604–4612.
- L. Xie, C. Xiang, Z. Yu, G. Xu, Z. Yang, D. Cai, and X. He, “PI-RCNN: An efficient multi-sensor 3D object detector with point-based attentive cont-conv fusion module,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 460–12 467.
- Z. Wang and K. Jia, “Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 1742–1749.
- V. A. Sindagi, Y. Zhou, and O. Tuzel, “Mvx-net: Multimodal voxelnet for 3d object detection,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 7276–7282.
- J. H. Yoo, Y. Kim, J. Kim, and J. W. Choi, “3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection,” in European Conference on Computer Vision. Springer, 2020, pp. 720–736.
- Z. Yang, J. Chen, Z. Miao, W. Li, X. Zhu, and L. Zhang, “Deepinteraction: 3d object detection via modality interaction,” Advances in Neural Information Processing Systems, vol. 35, pp. 1992–2005, 2022.
- Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, and F. Zhao, “Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3d object detection,” arXiv preprint arXiv:2207.10316, 2022.
- H. Yang, Z. Liu, X. Wu, W. Wang, W. Qian, X. He, and D. Cai, “Graph r-cnn: Towards accurate 3d object detection with semantic-decorated local graph,” in European Conference on Computer Vision. Springer, 2022, pp. 662–679.
- X. Li, B. Shi, Y. Hou, X. Wu, T. Ma, Y. Li, and L. He, “Homogeneous multi-modal feature fusion and interaction for 3d object detection,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, pp. 691–707.
- Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han, “BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation,” arXiv preprint arXiv:2205.13542, 2022.
- T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y. Wang, T. Tang, B. Wang, and Z. Tang, “BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework,” arXiv preprint arXiv:2205.13790, 2022.
- Y. Jiao, Z. Jie, S. Chen, J. Chen, L. Ma, and Y.-G. Jiang, “MSMDFusion: Fusing LiDAR and Camera at Multiple Scales With Multi-Depth Seeds for 3D Object Detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 21 643–21 652.
- H. Wu, C. Wen, S. Shi, X. Li, and C. Wang, “Virtual Sparse Convolution for Multimodal 3D Object Detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 21 653–21 662.
- X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, and C.-L. Tai, “Transfusion: Robust lidar-camera fusion for 3d object detection with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1090–1099.
- L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
- B. Graham, M. Engelcke, and L. van der Maaten, “3D Semantic Segmentation with Submanifold Sparse Convolutional Networks,” CVPR, 2018.
- B. Graham and L. van der Maaten, “Submanifold Sparse Convolutional Networks,” arXiv preprint arXiv:1706.01307, 2017.
- J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang, and H. Li, “Voxel r-cnn: Towards high performance voxel-based 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1201–1209.
- S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 529–10 538.
- W. Xiao, Y. Peng, C. Liu, J. Gao, Y. Wu, and X. Li, “Balanced Sample Assignment and Objective for Single-Model Multi-Class 3D Object Detection,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- J. Deng, W. Zhou, Y. Zhang, and H. Li, “From multi-view to hollow-3D: Hallucinated hollow-3D R-CNN for 3D object detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 12, pp. 4722–4734, 2021.
- Z. Liu, X. Zhao, T. Huang, R. Hu, Y. Zhou, and X. Bai, “Tanet: Robust 3d object detection from point clouds with triple attention,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 677–11 684.
- M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task multi-sensor fusion for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7345–7353.
- S. Pang, D. Morris, and H. Radha, “Fast-CLOCs: Fast camera-LiDAR object candidates fusion for 3D object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 187–196.
- H. Yang, Z. Liu, X. Wu, W. Wang, W. Qian, X. He, and D. Cai, “Graph R-CNN: Towards Accurate 3D Object Detection with Semantic-Decorated Local Graph,” in European Conference on Computer Vision. Springer, 2022, pp. 662–679.
- Z. Liu, T. Huang, B. Li, X. Chen, X. Wang, and X. Bai, “EPNet++: Cascade bi-directional fusion for multi-modal 3D object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- X. Tian, M. Yang, Q. Yu, J. Yong, and D. Xu, “MedoidsFormer: A Strong 3D Object Detection Backbone by Exploiting Interaction with Adjacent Medoid Tokens,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- H. Sheng, S. Cai, Y. Liu, B. Deng, J. Huang, X.-S. Hua, and M.-J. Zhao, “Improving 3d object detection with channel-wise transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2743–2752.
- X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3D Object Detection Network for Autonomous Driving.” in CVPR. IEEE Computer Society, 2017, pp. 6526–6534. [Online]. Available: http://dblp.uni-trier.de/db/conf/cvpr/cvpr2017.html#ChenMWLX17
- J. Wang, S. Lan, M. Gao, and L. S. Davis, “Infofocus: 3d object detection for autonomous driving with dynamic information modeling,” in European Conference on Computer Vision. Springer, 2020, pp. 405–420.
- W. Zheng, M. Hong, L. Jiang, and C.-W. Fu, “Boosting 3d object detection by simulating multimodality on point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 638–13 647.
- Y. Hu, Z. Ding, R. Ge, W. Shao, L. Huang, K. Li, and Q. Liu, “Afdetv2: Rethinking the necessity of the second stage for object detection from point clouds,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 1, 2022, pp. 969–979.
- S. Deng, Z. Liang, L. Sun, and K. Jia, “Vista: Boosting 3d object detection via dual cross-view spatial attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8448–8457.
- T. Yin, X. Zhou, and P. Krähenbühl, “Multimodal virtual point 3d detection,” Advances in Neural Information Processing Systems, vol. 34, pp. 16 494–16 507, 2021.
- C. Wang, C. Ma, M. Zhu, and X. Yang, “Pointaugmenting: Cross-modal augmentation for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 794–11 803.
- Y. Li, Y. Chen, X. Qi, Z. Li, J. Sun, and J. Jia, “Unifying Voxel-based Representation with Transformer for 3D Object Detection,” arXiv preprint arXiv:2206.00630, 2022.
- T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11 784–11 793.
- Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “Voxelnext: Fully sparse voxelnet for 3d object detection and tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 674–21 683.
- O. Team et al., “Openpcdet: An open-source toolbox for 3d object detection from point clouds,” 2020.