Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection (2212.05265v2)

Published 10 Dec 2022 in cs.CV

Abstract: LiDAR and camera fusion techniques are promising for achieving 3D object detection in autonomous driving. Most multi-modal 3D object detection frameworks integrate semantic knowledge from 2D images into 3D LiDAR point clouds to enhance detection accuracy. Nevertheless, the restricted resolution of 2D feature maps impedes accurate re-projection and often induces a pronounced boundary-blurring effect, which is primarily attributed to erroneous semantic segmentation. To well handle this limitation, we propose a general multi-modal fusion framework Multi-Sem Fusion (MSF) to fuse the semantic information from both the 2D image and 3D points scene parsing results. Specifically, we employ 2D/3D semantic segmentation methods to generate the parsing results for 2D images and 3D point clouds. The 2D semantic information is further reprojected into the 3D point clouds with calibration parameters. To handle the misalignment between the 2D and 3D parsing results, we propose an Adaptive Attention-based Fusion (AAF) module to fuse them by learning an adaptive fusion score. Then the point cloud with the fused semantic label is sent to the following 3D object detectors. Furthermore, we propose a Deep Feature Fusion (DFF) module to aggregate deep features at different levels to boost the final detection performance. The effectiveness of the framework has been verified on two public large-scale 3D object detection benchmarks by comparing them with different baselines. The experimental results show that the proposed fusion strategies can significantly improve the detection performance compared to the methods using only point clouds and the methods using only 2D semantic information. Most importantly, the proposed approach significantly outperforms other approaches and sets state-of-the-art results on the nuScenes testing benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. P. Wang, L. Shi, B. Chen, Z. Hu, J. Qiao, and Q. Dong, “Pursuing 3-d scene structures with optical satellite images from affine reconstruction to euclidean reconstruction,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022.
  2. Q. Xia, Y. Chen, G. Cai, G. Chen, D. Xie, J. Su, and Z. Wang, “3-d hanet: A flexible 3-d heatmap auxiliary network for object detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–13, 2023.
  3. T. Yin, X. Zhou, and P. Krähenbühl, “Center-based 3d object detection and tracking,” IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  4. J. Dou, J. Xue, and J. Fang, “Seg-voxelnet for 3d vehicle detection from rgb and lidar data,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 4362–4368, IEEE, 2019.
  5. S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting: Sequential fusion for 3d object detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4604–4612, 2020.
  6. Z. Zhang, M. Zhang, Z. Liang, X. Zhao, M. Yang, W. Tan, and S. Pu, “Maff-net: Filter false positive for 3d vehicle detection with multi-modal adaptive feature fusion,” arXiv preprint arXiv:2009.10945, 2020.
  7. D. Xu, D. Anguelov, and A. Jain, “Pointfusion: Deep sensor fusion for 3d bounding box estimation,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 244–253, 2018.
  8. Y. Tian, K. Wang, Y. Wang, Y. Tian, Z. Wang, and F.-Y. Wang, “Adaptive and azimuth-aware fusion network of multimodal local features for 3d object detection,” Neurocomputing, vol. 411, pp. 32–44, 2020.
  9. S. Pang, D. Morris, and H. Radha, “Clocs: Camera-lidar object candidates fusion for 3d object detection,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2020.
  10. S. Xu, D. Zhou, J. Fang, J. Yin, Z. Bin, and L. Zhang, “Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection,” in IEEE Intelligent Transportation Systems Conference, 2021.
  11. J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3d proposal generation and object detection from view aggregation,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1–8, IEEE, 2018.
  12. W. Luo, B. Yang, and R. Urtasun, “Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net,” in IEEE conference on Computer Vision and Pattern Recognition, pp. 3569–3577, 2018.
  13. B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection from point clouds,” in IEEE conference on Computer Vision and Pattern Recognition, pp. 7652–7660, 2018.
  14. Z. Liang, Z. Zhang, M. Zhang, X. Zhao, and S. Pu, “Rangeioudet: Range image based real-time 3d object detector optimized by intersection over union,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7140–7149, 2021.
  15. P. Hu, J. Ziglar, D. Held, and D. Ramanan, “What you see is what you get: Exploiting visibility for 3d object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 11001–11009, 2020.
  16. Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499, 2018.
  17. Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
  18. H. Kuang, B. Wang, J. An, M. Zhang, and Z. Zhang, “Voxel-fpn: Multi-scale voxel feature aggregation for 3d object detection from lidar point clouds,” Sensors, vol. 20, no. 3, p. 704, 2020.
  19. J. Yin, J. Shen, C. Guan, D. Zhou, and R. Yang, “Lidar-based online 3d video object detection with graph-based message passing and spatiotemporal transformer attention,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11495–11504, 2020.
  20. D. Zhou, J. Fang, X. Song, C. Guan, J. Yin, Y. Dai, and R. Yang, “Iou loss for 2d/3d object detection,” in 2019 International Conference on 3D Vision (3DV), pp. 85–94, IEEE, 2019.
  21. Z. Song, H. Wei, C. Jia, Y. Xia, X. Li, and C. Zhang, “Vp-net: Voxels as points for 3-d object detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–12, 2023.
  22. H. Wu, J. Deng, C. Wen, X. Li, C. Wang, and J. Li, “Casa: A cascade attention network for 3-d object detection from lidar point clouds,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–11, 2022.
  23. D. Zhou, J. Fang, X. Song, L. Liu, J. Yin, Y. Dai, H. Li, and R. Yang, “Joint 3d instance segmentation and object detection for autonomous driving,” in the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1839–1849, 2020.
  24. A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 12697–12705, 2019.
  25. C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927, 2018.
  26. C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660, 2017.
  27. S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779, 2019.
  28. S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 10529–10538, 2020.
  29. A. Simonelli, S. Bulò, L. Porzi, M. Lopez-Antequera, and P. Kontschieder, “Disentangling monocular 3d object detection,” Cornell University - arXiv, May 2019.
  30. T. Wang, X. Zhu, J. Pang, and D. Lin, “Fcos3d: Fully convolutional one-stage monocular 3d object detection,” in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Oct 2021.
  31. Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019.
  32. Y. You, Y. Wang, W. Chao, D. Garg, G. Pleiss, B. Hariharan, M. Campbell, and K. Weinberger, “Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving,” Cornell University - arXiv, Jun 2019.
  33. H. Ma, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and R. Urtasun, “3d object proposals for accurate object class detection,” Neural Information Processing Systems, Dec 2015.
  34. C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical depth distribution network for monocular 3d object detection,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021.
  35. Jan 2020.
  36. R. Chen, S. Han, J. Xu, and H. Su, “Point-based multi-view stereo network,”
  37. Y. Chen, S. Liu, X. Shen, and J. Jia, “Dsgn: Deep stereo geometry network for 3d object detection,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2020.
  38. Jan 2018.
  39. J. Nie, J. Yan, H. Yin, L. Ren, and Q. Meng, “A multimodality fusion deep neural network and safety test strategy for intelligent vehicles,” IEEE Transactions on Intelligent Vehicles, vol. 6, no. 2, pp. 310–322, 2020.
  40. M. P. Muresan, I. Giosan, and S. Nedevschi, “Stabilization and validation of 3d object position using multimodal sensor fusion and semantic segmentation,” Sensors, vol. 20, no. 4, p. 1110, 2020.
  41. M. P. Muresan and S. Nedevschi, “Multi-object tracking of 3d cuboids using aggregated features,” in 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP), pp. 11–18, IEEE, 2019.
  42. X. Du, M. H. Ang, S. Karaman, and D. Rus, “A general pipeline for 3d detection of vehicles,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3194–3200, IEEE, 2018.
  43. X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-view 3d object detection network for autonomous driving,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915, 2017.
  44. M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task multi-sensor fusion for 3d object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 7345–7353, 2019.
  45. B. Yang, M. Liang, and R. Urtasun, “Hdnet: Exploiting hd maps for 3d object detection,” in Conference on Robot Learning, pp. 146–155, PMLR, 2018.
  46. J. Fang, D. Zhou, X. Song, and L. Zhang, “Mapfusion: A general framework for 3d object detection with hdmaps,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021.
  47. R. Nabati and H. Qi, “Centerfusion: Center-based radar and camera fusion for 3d object detection,” in IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1527–1536, 2021.
  48. A. R. Choudhury, R. Vanguri, S. R. Jambawalikar, and P. Kumar, “Segmentation of brain tumors using deeplabv3+,” in International MICCAI Brainlesion Workshop, pp. 154–167, Springer, 2018.
  49. H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in IEEE conference on computer vision and pattern recognition, pp. 2881–2890, 2017.
  50. K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al., “Hybrid task cascade for instance segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4974–4983, 2019.
  51. E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 4, pp. 640–651, 2017.
  52. A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam, “Searching for mobilenetv3,” in The IEEE International Conference on Computer Vision (ICCV), pp. 1314–1324, October 2019.
  53. Y. Zhang, Z. Zhou, P. David, X. Yue, Z. Xi, B. Gong, and H. Foroosh, “Polarnet: An improved grid representation for online lidar point clouds semantic segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9601–9610, 2020.
  54. R. Cheng, R. Razani, E. Taghavi, E. Li, and B. Liu, “2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network,” arXiv preprint arXiv:2102.04530, 2021.
  55. J. Xu, R. Zhang, J. Dou, Y. Zhu, J. Sun, and S. Pu, “Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation,” arXiv preprint arXiv:2103.12978, 2021.
  56. P. Cong, X. Zhu, and Y. Ma, “Input-output balanced framework for long-tailed lidar semantic segmentation,” in 2021 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6, IEEE, 2021.
  57. J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146–3154, 2019.
  58. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361, IEEE, 2012.
  59. H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631, 2020.
  60. A. Simonelli, S. R. Bulo, L. Porzi, M. López-Antequera, and P. Kontschieder, “Disentangling monocular 3d object detection,” in IEEE International Conference on Computer Vision, pp. 1991–1999, 2019.
  61. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018.
  62. X. Zhu, H. Zhou, T. Wang, F. Hong, Y. Ma, W. Li, H. Li, and D. Lin, “Cylindrical and asymmetrical 3d convolution networks for lidar segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9939–9948, 2021.
  63. Y. Ma, D. Yu, T. Wu, and H. Wang, “Paddlepaddle: An open-source deep learning platform from industrial practice,” Frontiers of Data and Domputing, vol. 1, no. 1, pp. 105–115, 2019.
  64. Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “Voxelnext: Fully sparse voxelnet for 3d object detection and tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21674–21683, June 2023.
  65. Y. Chen, Y. Li, X. Zhang, J. Sun, and J. Jia, “Focal sparse convolutional networks for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5428–5437, June 2022.
  66. Z. Yang, Y. Sun, S. Liu, and J. Jia, “3dssd: Point-based 3d single stage object detector,” in the IEEE/CVF conference on computer vision and pattern recognition, pp. 11040–11048, 2020.
  67. B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu, “Class-balanced grouping and sampling for point cloud 3d object detection,” arXiv preprint arXiv:1908.09492, 2019.
  68. Q. Chen, L. Sun, Z. Wang, K. Jia, and A. Yuille, “Object as hotspots: An anchor-free 3d object detection approach via firing of hotspots,” in European Conference on Computer Vision, pp. 68–84, Springer, 2020.
  69. Q. Chen, L. Sun, E. Cheung, and A. L. Yuille, “Every view counts: Cross-view consistency in 3d object detection with hybrid-cylindrical-spherical voxelization,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  70. Y. Li, Y. Chen, X. Qi, Z. Li, J. Sun, and J. Jia, “Unifying voxel-based representation with transformer for 3d object detection,” 2022.
  71. J. H. Yoo, Y. Kim, J. S. Kim, and J. W. Choi, “3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection,” in Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXVII (A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, eds.), vol. 12372 of Lecture Notes in Computer Science, pp. 720–736, Springer, 2020.
  72. M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
  73. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019.
  74. Q.-Y. Zhou, J. Park, and V. Koltun, “Open3d: A modern library for 3d data processing,” arXiv preprint arXiv:1801.09847, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shaoqing Xu (11 papers)
  2. Fang Li (142 papers)
  3. Ziying Song (23 papers)
  4. Jin Fang (23 papers)
  5. Sifen Wang (1 paper)
  6. Zhi-Xin Yang (16 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.