Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection (2403.15241v1)

Published 22 Mar 2024 in cs.CV

Abstract: Bird's eye view (BEV) representation has emerged as a dominant solution for describing 3D space in autonomous driving scenarios. However, objects in the BEV representation typically exhibit small sizes, and the associated point cloud context is inherently sparse, which leads to great challenges for reliable 3D perception. In this paper, we propose IS-Fusion, an innovative multimodal fusion framework that jointly captures the Instance- and Scene-level contextual information. IS-Fusion essentially differs from existing approaches that only focus on the BEV scene-level fusion by explicitly incorporating instance-level multimodal information, thus facilitating the instance-centric tasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF) module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Grid and Grid-to-Region transformers to capture the multimodal scene context at different granularities. IGF mines instance candidates, explores their relationships, and aggregates the local multimodal context for each instance. These instances then serve as guidance to enhance the scene feature and yield an instance-aware BEV representation. On the challenging nuScenes benchmark, IS-Fusion outperforms all the published multimodal works to date. Code is available at: https://github.com/yinjunbo/IS-Fusion.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In CVPR, 2022.
  2. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  3. Objectfusion: Multi-modal 3d object detection with object-centric fusion. In ICCV, 2023.
  4. Multi-view 3d object detection network for autonomous driving. In CVPR, 2017.
  5. Futr3d: A unified sensor fusion framework for 3d detection. In CVPRW, 2023.
  6. Focal sparse convolutional networks for 3d object detection. In CVPR, 2022.
  7. Voxelnext: Fully sparse voxelnet for 3d object detection and tracking. In CVPR, 2023.
  8. Fast point r-cnn. In ICCV, 2019.
  9. Focalformer3d: Focusing on hard instance for 3d object detection. In ICCV, 2023.
  10. Autoalignv2: Deformable feature aggregation for dynamic multi-modal 3d object detection. In ECCV, 2022.
  11. MMDetection3D Contributors. MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d, 2020.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  13. Interpretable3d: An ad-hoc interpretable classifier for 3d point clouds. In AAAI, 2024.
  14. Lsknet: Towards effective and efficient 3d perception with large sparse kernels. In CVPR, 2024.
  15. Clustering based point cloud representation learning for 3d analysis. In ICCV, 2023.
  16. Neural message passing for quantum chemistry. In ICML, 2017.
  17. 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018.
  18. Brnet: Exploring comprehensive features for monocular depth estimation. In ECCV, 2022.
  19. Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. In CVPR, 2022.
  20. Msf: Motion-guided sequential fusion for efficient 3d object detection from point cloud sequences. In CVPR, 2023.
  21. Deep residual learning for image recognition. In CVPR, 2016.
  22. Point density-aware voxels for lidar 3d object detection. In CVPR, 2022.
  23. What makes multi-modal learning better than single (provably). In NeurIPS, 2021.
  24. Joint 3d proposal generation and object detection from view aggregation. In IROS, 2018.
  25. Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  26. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019.
  27. Stereo r-cnn based 3d object detection for autonomous driving. In CVPR, 2019.
  28. Logonet: Towards accurate 3d object detection with local-to-global cross-modal fusion. In CVPR, 2023.
  29. Di-v2x: Learning domain-invariant representation for vehicle-infrastructure collaborative 3d object detection. In AAAI, 2024.
  30. Lwsis: Lidar-guided weakly supervised instance segmentation for autonomous driving. In AAAI, 2023.
  31. Unifying voxel-based representation with transformer for 3d object detection. In NeurIPS, 2022.
  32. Voxel field fusion for 3d object detection. In CVPR, 2022.
  33. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In CVPR, 2022.
  34. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022.
  35. Multi-task multi-sensor fusion for 3d object detection. In CVPR, 2019.
  36. Deep continuous fusion for multi-sensor 3d object detection. In ECCV, 2018.
  37. Bevfusion: A simple and robust lidar-camera fusion framework. In NeurIPS, 2022.
  38. Focal loss for dense object detection. In ICCV, 2017.
  39. Bird’s-eye-view scene graph for vision-language navigation. In ICCV, 2023.
  40. Petr: Position embedding transformation for multi-view 3d object detection. In ECCV, 2022.
  41. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  42. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, 2023.
  43. Decoupled weight decay regularization. In ICLR, 2019.
  44. Towards a weakly supervised framework for 3d point cloud object detection and annotation. IEEE TPAMI, 44(8):4454–4468, 2021.
  45. Pvgnet: A bottom-up one-stage 3d object detector with integrated multi-level features. In CVPR, 2021.
  46. 4d-net for learned multi-modal alignment. In ICCV, 2021.
  47. Deep hough voting for 3d object detection in point clouds. In ICCV, 2019.
  48. Frustum pointnets for 3d object detection from rgb-d data. In CVPR, 2018.
  49. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017.
  50. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017.
  51. Pillarnet: Real-time and high-performance pillar-based 3d object detection. In ECCV, 2022.
  52. Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In CVPR, 2020.
  53. Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, 2019.
  54. Point-gnn: Graph neural network for 3d object detection in a point cloud. In CVPR, 2020.
  55. Leslie N Smith. Cyclical learning rates for training neural networks. In WACV, 2017.
  56. Graphalign: Enhancing accurate feature alignment by graph matching for multi-modal 3d object detection. In ICCV, 2023.
  57. Attention is all you need. In NeurIPS, 2017.
  58. Pointpainting: Sequential fusion for 3d object detection. In CVPR, 2020.
  59. Pointaugmenting: Cross-modal augmentation for 3d object detection. In CVPR, 2021.
  60. Cspnet: A new backbone that can enhance learning capability of cnn. In CVPR, 2020.
  61. Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation. In ICCV, 2023.
  62. Depth-conditioned dynamic message propagation for monocular 3d object detection. In CVPR, 2021.
  63. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In ECCV, 2019.
  64. Ssda3d: Semi-supervised domain adaptation for 3d object detection from point cloud. In AAAI, 2023.
  65. Virtual sparse convolution for multimodal 3d object detection. In CVPR, 2023.
  66. Sparsefusion: Fusing multi-modal sparse representations for multi-sensor 3d object detection. In ICCV, 2023.
  67. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In CVPR, 2018.
  68. Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection. In IEEE International Intelligent Transportation Systems Conference (ITSC), 2021.
  69. Cross modal transformer via coordinates encoding for 3d object dectection. In ICCV, 2023.
  70. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
  71. Graph r-cnn: Towards accurate 3d object detection with semantic-decorated local graph. In ECCV, 2022.
  72. Deepinteraction: 3d object detection via modality interaction. In NeurIPS, 2022.
  73. 3dssd: Point-based 3d single stage object detector. In CVPR, 2020.
  74. Std: Sparse-to-dense 3d object detector for point cloud. In ICCV, 2019.
  75. Efficient detr: improving end-to-end object detector with dense prior. arXiv preprint arXiv:2104.01318, 2021.
  76. Semi-supervised 3d object detection with proficient teachers. In ECCV, 2022.
  77. Graph neural network and spatiotemporal transformer attention for 3d video object detection from point clouds. IEEE TPAMI, 45(8):9822–9835, 2021.
  78. Proposalcontrast: Unsupervised pre-training for lidar-based 3d object detection. In ECCV, 2022.
  79. Center-based 3d object detection and tracking. In CVPR, 2021.
  80. Multimodal virtual point 3d detection. In NeurIPS, 2021.
  81. Cat-det: Contrastively augmented transformer for multi-modal 3d object detection. In CVPR, 2022.
  82. Point transformer. In ICCV, 2021.
  83. Octr: Octree-based transformer for 3d object detection. In CVPR, 2023.
  84. Joint 3d instance segmentation and object detection for autonomous driving. In CVPR, 2020.
  85. Voxelnet: End-to-end learning for point cloud based 3d object detection. In CVPR, 2018.
  86. Class-balanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492, 2019.
  87. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2021.
  88. Curricular object manipulation in lidar-based object detection. In CVPR, 2023.
Citations (12)

Summary

We haven't generated a summary for this paper yet.