Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection (2403.07372v1)

Published 12 Mar 2024 in cs.CV

Abstract: Recent 3D object detectors typically utilize multi-sensor data and unify multi-modal features in the shared bird's-eye view (BEV) representation space. However, our empirical findings indicate that previous methods have limitations in generating fusion BEV features free from cross-modal conflicts. These conflicts encompass extrinsic conflicts caused by BEV feature construction and inherent conflicts stemming from heterogeneous sensor signals. Therefore, we propose a novel Eliminating Conflicts Fusion (ECFusion) method to explicitly eliminate the extrinsic/inherent conflicts in BEV space and produce improved multi-modal BEV features. Specifically, we devise a Semantic-guided Flow-based Alignment (SFA) module to resolve extrinsic conflicts via unifying spatial distribution in BEV space before fusion. Moreover, we design a Dissolved Query Recovering (DQR) mechanism to remedy inherent conflicts by preserving objectness clues that are lost in the fusion BEV feature. In general, our method maximizes the effective information utilization of each modality and leverages inter-modal complementarity. Our method achieves state-of-the-art performance in the highly competitive nuScenes 3D object detection dataset. The code is released at https://github.com/fjhzhixi/ECFusion.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han, “BEVFusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,” in ICRA, 2023.
  2. T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y. Wang, T. Tang, B. Wang, and Z. Tang, “BEVFusion: A simple and robust lidar-camera fusion framework,” in NeurIPS, 2022.
  3. X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, and C. Tai, “Transfusion: Robust lidar-camera fusion for 3d object detection with transformers,” in CVPR, 2022.
  4. T. Wang, X. Zhu, J. Pang, and D. Lin, “FCOS3D: Fully convolutional one-stage monocular 3d object detection,” in ICCV, 2021.
  5. T. Wang, Z. Xinge, J. Pang, and D. Lin, “Probabilistic and geometric depth: Detecting objects in perspective,” in CoRL, 2022.
  6. Z. Wu, Y. Gan, L. Wang, G. Chen, and J. Pu, “Monopgc: Monocular 3d object detection with pixel geometry contexts,” in ICRA, 2023.
  7. C. Gao, J. Chen, S. Liu, L. Wang, Q. Zhang, and Q. Wu, “Room-and-object aware knowledge reasoning for remote embodied referring expression,” in CVPR, 2021.
  8. Y. Wang, V. C. Guizilini, T. Zhang, Y. Wang, H. Zhao, and J. Solomon, “DETR3D: 3d object detection from multi-view images via 3d-to-2d queries,” in CoRL, 2022.
  9. Y. Liu, T. Wang, X. Zhang, and J. Sun, “PETR: Position embedding transformation for multi-view 3d object detection,” in ECCV, 2022.
  10. Y. Liu, J. Yan, F. Jia, S. Li, Q. Gao, T. Wang, X. Zhang, and J. Sun, “PETRv2: A unified framework for 3d perception from multi-camera images,” arXiv preprint arXiv:2206.01256, 2022.
  11. Z. Wang, Z. Huang, J. Fu, N. Wang, and S. Liu, “Object as query: Lifting any 2d object detector to 3d detection,” in ICCV, 2023.
  12. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020.
  13. Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in ECCV, 2022.
  14. C. Yang, Y. Chen, H. Tian, C. Tao, X. Zhu, Z. Zhang, G. Huang, H. Li, Y. Qiao, L. Lu, et al., “BEVFormer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision,” arXiv preprint arXiv:2211.10439, 2022.
  15. Y. Jiang, L. Zhang, Z. Miao, X. Zhu, J. Gao, W. Hu, and Y. Jiang, “Polarformer: Multi-camera 3d object detection with polar transformers,” in AAAI, 2023.
  16. J. Huang, G. Huang, Z. Zhu, Y. Yun, and D. Du, “BEVDet: High-performance multi-camera 3d object detection in bird-eye-view,” arXiv preprint arXiv:2112.11790, 2021.
  17. Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, and Z. Li, “BEVDepth: Acquisition of reliable depth for multi-view 3d object detection,” in AAAI, 2023.
  18. E. Xie, Z. Yu, D. Zhou, J. Philion, A. Anandkumar, S. Fidler, P. Luo, and J. M. Alvarez, “M2bev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation,” arXiv preprint arXiv:2204.05088, 2022.
  19. J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” in ECCV, 2020.
  20. C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in CVPR, 2017.
  21. C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in NeurIPS, 2017.
  22. S. Shi, X. Wang, and H. Li, “PointRCNN: 3D object proposal generation and detection from point cloud,” in CVPR, 2019.
  23. S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “PV-RCNN: Point-voxel feature set abstraction for 3d object detection,” in CVPR, 2020.
  24. Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia, “Std: Sparse-to-dense 3d object detector for point cloud,” in ICCV, 2019.
  25. S. Shi, X. Wang, and H. Li, “Part-a22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT net: 3d part-aware and aggregation neural network for object detection from point cloud,” in CVPR, 2019.
  26. T. Yin, X. Zhou, and P. Krähenbühl, “Center-based 3d object detection and tracking,” in CVPR, 2021.
  27. Y. Yan, Y. Mao, and B. Li, “SECOND: sparsely embedded convolutional detection,” Sensors, 2018.
  28. A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in CVPR, 2019.
  29. B. Yang, W. Luo, and R. Urtasun, “Pixor: Real-time 3d object detection from point clouds,” in CVPR, 2018.
  30. Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in CVPR, 2018.
  31. L. Fan, X. Xiong, F. Wang, N. Wang, and Z. Zhang, “Rangedet: In defense of range view for lidar-based 3d object detection,” in ICCV, 2021.
  32. X. Wang and K. M. Kitani, “Cost-aware evaluation and model scaling for lidar-based 3d object detection,” in ICRA, 2023.
  33. Z. Liang, M. Zhang, Z. Zhang, X. Zhao, and S. Pu, “Rangercnn: Towards fast and accurate 3d object detection with range image representation,” arXiv preprint arXiv:2009.00206, 2020.
  34. S. Vora, A. H. Lang, B. Helou, and O. Beijbom, “Pointpainting: Sequential fusion for 3d object detection,” in CVPR, 2020.
  35. C. Wang, C. Ma, M. Zhu, and X. Yang, “Pointaugmenting: Cross-modal augmentation for 3d object detection,” in CVPR, 2021.
  36. T. Huang, Z. Liu, X. Chen, and X. Bai, “Epnet: Enhancing point features with image semantics for 3d object detection,” in ECCV, 2020.
  37. T. Yin, X. Zhou, and P. Krähenbühl, “Multimodal virtual point 3d detection,” in NeurIPS, 2021.
  38. M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun, “Multi-task multi-sensor fusion for 3d object detection,” in CVPR, 2019.
  39. X. Wu, L. Peng, H. Yang, L. Xie, C. Huang, C. Deng, H. Liu, and D. Cai, “Sparse fuse dense: Towards high quality 3d detection with depth completion,” in CVPR, 2022, pp. 5418–5427.
  40. H. Zhu, J. Deng, Y. Zhang, J. Ji, Q. Mao, H. Li, and Y. Zhang, “Vpfnet: Improving 3d object detection with virtual point based lidar and stereo data fusion,” IEEE Trans. Multimedia, 2022.
  41. Y. Chen, Y. Li, X. Zhang, J. Sun, and J. Jia, “Focal sparse convolutional networks for 3d object detection,” in CVPR, 2022.
  42. P. Jacobson, Y. Zhou, W. Zhan, M. Tomizuka, and M. C. Wu, “Center feature fusion: Selective multi-sensor fusion of center-based objects,” in ICRA, 2023.
  43. H. Li, Z. Zhang, X. Zhao, Y. Wang, Y. Shen, S. Pu, and H. Mao, “Enhancing multi-modal features using local self-attention for 3d object detection,” in ECCV, 2022.
  44. J. Xu, R. Zhang, J. Dou, Y. Zhu, J. Sun, and S. Pu, “Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation,” in ICCV, 2021.
  45. C. R. Qi, X. Chen, O. Litany, and L. J. Guibas, “Imvotenet: Boosting 3d object detection in point clouds with image votes,” in CVPR, 2020.
  46. Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, F. Zhao, B. Zhou, and H. Zhao, “Autoalign: Pixel-instance feature aggregation for multi-modal 3d object detection,” in IJCAI, 2022.
  47. Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, and F. Zhao, “Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3d object detection,” in ECCV, 2022.
  48. Y. Li, X. Qi, Y. Chen, L. Wang, Z. Li, J. Sun, and J. Jia, “Voxel field fusion for 3d object detection,” in CVPR, 2022.
  49. Y. Gao, C. Sima, S. Shi, S. Di, S. Liu, and H. Li, “Sparse dense fusion for 3d object detection,” in IROS, 2023.
  50. X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei, “Deep feature flow for video recognition,” in CVPR, 2017.
  51. X. Li, A. You, Z. Zhu, H. Zhao, M. Yang, K. Yang, S. Tan, and Y. Tong, “Semantic flow for fast and accurate scene parsing,” in ECCV, 2020.
  52. H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in CVPR, 2020.
  53. J. Lu, Z. Zhou, X. Zhu, H. Xu, and L. Zhang, “Learning ego 3d representation as ray tracing,” in ECCV, 2022.
  54. Y. Chen, J. Liu, X. Qi, X. Zhang, J. Sun, and J. Jia, “Scaling up kernels in 3d cnns,” in CVPR, 2023.
  55. X. Chen, T. Zhang, Y. Wang, Y. Wang, and H. Zhao, “Futr3d: A unified sensor fusion framework for 3d detection,” arXiv preprint arXiv:2203.10642, 2022.
  56. Y. Li, Y. Chen, X. Qi, Z. Li, J. Sun, and J. Jia, “Unifying voxel-based representation with transformer for 3d object detection,” in NeurIPS, 2022.
  57. Z. Yang, J. Chen, Z. Miao, W. Li, X. Zhu, and L. Zhang, “Deepinteraction: 3d object detection via modality interaction,” in NeurIPS, 2022.
Citations (4)

Summary

We haven't generated a summary for this paper yet.