Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lift-Attend-Splat: Bird's-eye-view camera-lidar fusion using transformers (2312.14919v3)

Published 22 Dec 2023 in cs.CV and cs.LG

Abstract: Combining complementary sensor modalities is crucial to providing robust perception for safety-critical robotics applications such as autonomous driving (AD). Recent state-of-the-art camera-lidar fusion methods for AD rely on monocular depth estimation which is a notoriously difficult task compared to using depth information from the lidar directly. Here, we find that this approach does not leverage depth as expected and show that naively improving depth estimation does not lead to improvements in object detection performance. Strikingly, we also find that removing depth estimation altogether does not degrade object detection performance substantially, suggesting that relying on monocular depth could be an unnecessary architectural bottleneck during camera-lidar fusion. In this work, we introduce a novel fusion method that bypasses monocular depth estimation altogether and instead selects and fuses camera and lidar features in a bird's-eye-view grid using a simple attention mechanism. We show that our model can modulate its use of camera features based on the availability of lidar features and that it yields better 3D object detection on the nuScenes dataset than baselines relying on monocular depth estimation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1090–1099, 2022.
  2. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
  3. Monocular 3d object detection for autonomous driving. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2147–2156, 2016.
  4. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017.
  5. Focal sparse convolutional networks for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5428–5437, 2022a.
  6. Autoalign: pixel-instance feature aggregation for multi-modal 3d object detection. arXiv preprint arXiv:2201.06493, 2022b.
  7. Monodistill: Learning spatial features for monocular 3d object detection. arXiv preprint arXiv:2201.10830, 2022.
  8. Deepfusion: A robust and modular 3d object detector for lidars, cameras and radars. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 560–567. IEEE, 2022.
  9. Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2014.
  10. Convolutional sequence to sequence learning. In International conference on machine learning, pages 1243–1252. PMLR, 2017.
  11. Liga-stereo: Learning lidar geometry aware representations for stereo-based 3d detector. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3153–3163, 2021.
  12. Exploring recurrent long-term temporal fusion for multi-view 3d perception. arXiv preprint arXiv:2303.05970, 2023.
  13. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  14. Ea-lss: Edge-aware lift-splat-shot framework for 3d bev object detection, 2023.
  15. Rethinking dimensionality reduction in grid-based 3d object detection. arXiv preprint arXiv:2209.09464, 2022.
  16. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  17. Epnet: Enhancing point features with image semantics for 3d object detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16, pages 35–52. Springer, 2020.
  18. Far3d: Expanding the horizon for surround-view 3d object detection. arXiv preprint arXiv:2308.09616, 2023.
  19. Mgtanet: Encoding sequential lidar points using long short-term motion-guided temporal attention for 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1179–1187, 2023.
  20. Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8. IEEE, 2018.
  21. Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11867–11876, 2019.
  22. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019.
  23. Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17182–17191, 2022a.
  24. Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1477–1485, 2023.
  25. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022b.
  26. Deep continuous fusion for multi-sensor 3d object detection. In Proceedings of the European conference on computer vision (ECCV), pages 641–656, 2018.
  27. CBNet: A composite backbone network architecture for object detection. IEEE Transactions on Image Processing, 31:6893–6906, 2022a.
  28. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems, 35:10421–10434, 2022b.
  29. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  30. Sparse4d: Multi-view 3d object detection with sparse spatial-temporal fusion. arXiv preprint arXiv:2211.10581, 2022.
  31. Petr: Position embedding transformation for multi-view 3d object detection. In European Conference on Computer Vision, pages 531–548. Springer, 2022.
  32. Petrv2: A unified framework for 3d perception from multi-camera images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3262–3272, 2023a.
  33. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2774–2781. IEEE, 2023b.
  34. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  35. Anton Mikhailov. Turbo, an improved rainbow colormap for visualization. https://blog.research.google/2019/08/turbo-improved-rainbow-colormap-for.html, 2019. Accessed: 20th October 2023.
  36. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
  37. Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3251–3260, 2020.
  38. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
  39. 4d-net for learned multi-modal alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15435–15445, 2021.
  40. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017a.
  41. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017b.
  42. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 918–927, 2018.
  43. End-to-end pseudo-lidar for image-based 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5881–5890, 2020.
  44. Translating images into maps. In 2022 International conference on robotics and automation (ICRA), pages 9200–9206. IEEE, 2022.
  45. Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 770–779, 2019.
  46. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  47. Weighted boxes fusion: ensembling boxes for object detection models. CoRR, abs/1910.13302, 2019.
  48. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  49. Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4604–4612, 2020.
  50. Pointaugmenting: Cross-modal augmentation for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11794–11803, 2021a.
  51. Pointaugmenting: Cross-modal augmentation for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11794–11803, 2021b.
  52. Exploring object-centric temporal modeling for efficient multi-view 3d object detection. arXiv preprint arXiv:2303.11926, 2023.
  53. What makes training multi-modal classification networks hard? In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12695–12705, 2020.
  54. Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1742–1749. IEEE, 2019.
  55. Fusing bird view lidar point cloud and front view camera image for deep object detection, 2018.
  56. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
  57. Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pages 3047–3054. IEEE, 2021.
  58. Second: Sparsely embedded convolutional detection. Sensors (Basel, Switzerland), 18, 2018a.
  59. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018b.
  60. Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7652–7660, 2018.
  61. Deepinteraction: 3d object detection via modality interaction. Advances in Neural Information Processing Systems, 35:1992–2005, 2022.
  62. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021a.
  63. Multimodal virtual point 3d detection. Advances in Neural Information Processing Systems, 34:16494–16507, 2021b.
  64. Sa-bev: Generating semantic-aware bird’s-eye-view feature for multi-view 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3348–3357, 2023.
  65. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4490–4499, 2018.
  66. Class-balanced grouping and sampling for point cloud 3d object detection, 2019.
  67. Temporal enhanced training of multi-view 3d object detector via historical object prediction. arXiv preprint arXiv:2304.00967, 2023.
Citations (3)

Summary

We haven't generated a summary for this paper yet.