Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SimDistill: Simulated Multi-modal Distillation for BEV 3D Object Detection (2303.16818v4)

Published 29 Mar 2023 in cs.CV

Abstract: Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging and may lead to inferior performance. Although distilling precise 3D geometry knowledge from LiDAR data could help tackle this challenge, the benefits of LiDAR information could be greatly hindered by the significant modality gap between different sensory modalities. To address this issue, we propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy. Specifically, we devise multi-modal architectures for both teacher and student models, including a LiDAR-camera fusion-based teacher and a simulated fusion-based student. Owing to the ``identical'' architecture design, the student can mimic the teacher to generate multi-modal features with merely multi-view images as input, where a geometry compensation module is introduced to bridge the modality gap. Furthermore, we propose a comprehensive multi-modal distillation scheme that supports intra-modal, cross-modal, and multi-modal fusion distillation simultaneously in the Bird's-eye-view space. Incorporating them together, our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment. Extensive experiments validate the effectiveness and superiority of SimDistill over state-of-the-art methods, achieving an improvement of 4.8\% mAP and 4.1\% NDS over the baseline detector. The source code will be released at https://github.com/ViTAE-Transformer/SimDistill.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  2. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  3. Sasa: Semantics-augmented set abstraction for point-based 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence.
  4. Polar parametrization for vision-based surround-view 3d detection. arXiv preprint arXiv:2206.10965.
  5. BEVDistill: Cross-modal BEV distillation for multi-view 3D object detection. In The Eleventh International Conference on Learning Representations.
  6. itkd: Interchange transfer-based knowledge distillation for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  7. MonoDistill: Learning Spatial Features for Monocular 3D Object Detection. In International Conference on Learning Representations.
  8. OA-BEV: Bringing object awareness to bird’s-eye-view representation for multi-camera 3D object detection. arXiv preprint arXiv:2301.05711.
  9. Contributors, M. 2020. MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d.
  10. Deformable convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  11. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  12. Knowledge distillation: A survey. International Journal of Computer Vision.
  13. 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  15. Cross-modality knowledge distillation network for monocular 3D object detection. In Proceedings of the European Conference on Computer Vision.
  16. Bevdet4d: Exploit temporal cues in multi-camera 3d object detection. arXiv preprint arXiv:2203.17054.
  17. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790.
  18. Monodtr: Monocular 3d object detection with depth-aware transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  19. TiG-BEV: Multi-view BEV 3D object detection via target inner-geometry learning. arXiv preprint arXiv:2212.13979.
  20. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  21. BEV-LGKD: A unified LiDAR-guided knowledge distillation framework for BEV 3D object detection. arXiv preprint arXiv:2212.00623.
  22. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Advances in Neural Information Processing Systems.
  23. Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In Proceedings of the AAAI Conference on Artificial Intelligence.
  24. Unifying voxel-based representation with transformer for 3d object detection. Advances in Neural Information Processing Systems.
  25. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. Proceedings of the AAAI Conference on Artificial Intelligence.
  26. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Proceedings of the European Conference on Computer Vision.
  27. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems.
  28. Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  29. Petr: Position embedding transformation for multi-view 3d object detection. In Proceedings of the European Conference on Computer Vision.
  30. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  31. BEVFusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In IEEE International Conference on Robotics and Automation.
  32. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  33. Geometry uncertainty projection network for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  34. Delving into localization errors for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  35. Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection. In The Eleventh International Conference on Learning Representations.
  36. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Proceedings of the European Conference on Computer Vision.
  37. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  38. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in Neural Information Processing Systems.
  39. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  40. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  41. Disentangling monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  42. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  43. Crossdtr: Cross-view and depth-guided transformers for 3d object detection. In IEEE International Conference on Robotics and Automation. IEEE.
  44. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  45. Pointaugmenting: Cross-modal augmentation for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  46. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
  47. Pillar-based object detection for autonomous driving. In Proceedings of the European Conference on Computer Vision.
  48. Sts: Surround-view temporal stereo for multi-view 3d detection. arXiv preprint arXiv:2208.10145.
  49. Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection. In IEEE International Intelligent Transportation Systems Conference. IEEE.
  50. Cross modal transformer via coordinates encoding for 3D object dectection. arXiv preprint arXiv:2301.01283.
  51. Second: Sparsely embedded convolutional detection. Sensors.
  52. Towards efficient 3d object detection with knowledge distillation. Advances in Neural Information Processing Systems.
  53. Deepinteraction: 3d object detection via modality interaction. Advances in Neural Information Processing Systems.
  54. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  55. Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things. IEEE Internet of Things Journal.
  56. Structured knowledge distillation towards efficient and compact multi-view 3D fetection. arXiv preprint arXiv:2211.08398.
  57. Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. International Journal of Computer Vision.
  58. Jperceiver: Joint perception network for depth, pose and layout estimation in driving scenes. In European Conference on Computer Vision.
  59. UniDistill: A universal cross-modality knowledge distillation framework for 3D object detection in bird’s-eye view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  60. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  61. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Haimei Zhao (9 papers)
  2. Qiming Zhang (31 papers)
  3. Shanshan Zhao (39 papers)
  4. Zhe Chen (237 papers)
  5. Jing Zhang (731 papers)
  6. Dacheng Tao (829 papers)
Citations (10)

Summary

We haven't generated a summary for this paper yet.