BEVCar: Camera-Radar Fusion for BEV Map and Object Segmentation (2403.11761v2)
Abstract: Semantic scene segmentation from a bird's-eye-view (BEV) perspective plays a crucial role in facilitating planning and decision-making for mobile robots. Although recent vision-only methods have demonstrated notable advancements in performance, they often struggle under adverse illumination conditions such as rain or nighttime. While active sensors offer a solution to this challenge, the prohibitively high cost of LiDARs remains a limiting factor. Fusing camera data with automotive radars poses a more inexpensive alternative but has received less attention in prior research. In this work, we aim to advance this promising avenue by introducing BEVCar, a novel approach for joint BEV object and map segmentation. The core novelty of our approach lies in first learning a point-based encoding of raw radar data, which is then leveraged to efficiently initialize the lifting of image features into the BEV space. We perform extensive experiments on the nuScenes dataset and demonstrate that BEVCar outperforms the current state of the art. Moreover, we show that incorporating radar information significantly enhances robustness in challenging environmental conditions and improves segmentation performance for distant objects. To foster future research, we provide the weather split of the nuScenes dataset used in our experiments, along with our code and trained models at http://bevcar.cs.uni-freiburg.de.
- H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuScenes: A multimodal dataset for autonomous driving,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 618–11 628.
- D. Barnes, M. Gadd, P. Murcutt, P. Newman, and I. Posner, “The Oxford Radar RobotCar dataset: A radar extension to the Oxford RobotCar dataset,” in IEEE International Conference on Robotics and Automation, 2020, pp. 6433–6438.
- L. Zheng, Z. Ma, X. Zhu, B. Tan, S. Li, K. Long, W. Sun, S. Chen, L. Zhang, M. Wan, L. Huang, and J. Bai, “TJ4DRadSet: A 4D radar dataset for autonomous driving,” in IEEE International Conference on Intelligent Transportation Systems, 2022, pp. 493–498.
- Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in European Conference on Computer Vision, 2022.
- T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y. Wang, T. Tang, B. Wang, and Z. Tang, “BEVFusion: A simple and robust LiDAR-camera fusion framework,” in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 10 421–10 434.
- A. W. Harley, Z. Fang, J. Li, R. Ambrus, and K. Fragkiadaki, “Simple-BEV: What really matters for multi-sensor BEV perception?” in IEEE International Conference on Robotics and Automation, 2023, pp. 2759–2765.
- X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, and C.-L. Tai, “TransFusion: Robust LiDAR-camera fusion for 3D object detection with transformers,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1080–1089.
- Y. Man, L.-Y. Gui, and Y.-X. Wang, “BEV-guided multi-modality fusion for driving perception,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 960–21 969.
- Y. Kim, J. Shin, S. Kim, I.-J. Lee, J. W. Choi, and D. Kum, “CRN: Camera radar net for accurate, robust, efficient 3D perception,” in International Conference on Computer Vision, 2023, pp. 17 569–17 580.
- J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D,” in European Conference on Computer Vision, 2020, pp. 194–210.
- Y. Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and J. Lu, “BEVerse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving,” arXiv preprint arXiv:2205.09743, 2022.
- N. Hendy, C. Sloan, F. Tian, P. Duan, N. Charchut, Y. Xie, C. Wang, and J. Philbin, “FISHING Net: Future inference of semantic heatmaps in grids,” arXiv preprint arXiv:2006.09917, 2020.
- C. Lu, M. J. G. van de Molengraft, and G. Dubbelman, “Monocular semantic occupancy grid mapping with convolutional variational encoder–decoder networks,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 445–452, 2019.
- B. Pan, J. Sun, H. Y. T. Leung, A. Andonian, and B. Zhou, “Cross-view semantic segmentation for sensing surroundings,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4867–4873, 2020.
- T. Roddick and R. Cipolla, “Predicting semantic map representations from images using pyramid occupancy networks,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 135–11 144.
- N. Gosala and A. Valada, “Bird’s-eye-view panoptic segmentation using monocular frontal view images,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 1968–1975, 2022.
- Y. You, Y. Wang, W.-L. Chao, D. Garg, G. Pleiss, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-LiDAR++: Accurate depth for 3D object detection in autonomous driving,” in International Conference on Learning Representations, 2020.
- Y. Li, Z. Yu, C. Choy, C. Xiao, J. M. Alvarez, S. Fidler, C. Feng, and A. Anandkumar, “VoxFormer: Sparse voxel transformer for camera-based 3D semantic scene completion,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9087–9098.
- A. Saha, O. Mendez, C. Russell, and R. Bowden, “Translating images into maps,” in IEEE International Conference on Robotics and Automation, 2022, pp. 9200–9206.
- B. Zhou and P. Krähenbühl, “Cross-view transformers for real-time map-view semantic segmentation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 750–13 759.
- L. Peng, Z. Chen, Z. Fu, P. Liang, and E. Cheng, “BEVSegFormer: Bird’s eye view semantic segmentation from arbitrary camera rigs,” in IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5924–5932.
- X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable transformers for end-to-end object detection,” International Conference on Learning Representations, 2021.
- J. Huang, G. Huang, Z. Zhu, Y. Ye, and D. Du, “BEVDet: High-performance multi-camera 3D object detection in bird-eye-view,” arXiv preprint arXiv:2112.11790, 2021.
- N. Gosala, K. Petek, P. L. J. Drews-Jr, W. Burgard, and A. Valada, “SkyEye: Self-supervised bird’s-eye-view semantic mapping using monocular frontal view images,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 901–14 910.
- M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, et al., “DINOv2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
- Y. Zhou and O. Tuzel, “VoxelNet: End-to-end learning for point cloud based 3D object detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4490–4499.
- I. T. Kurniawan and B. R. Trilaksono, “ClusterFusion: Leveraging radar spatial features for radar-camera 3D object detection in autonomous vehicles,” IEEE Access, vol. 11, pp. 121 511–121 528, 2023.
- Z. Yu, W. Wan, M. Ren, X. Zheng, and Z. Fang, “SparseFusion3D: Sparse sensor fusion for 3D object detection by radar and camera in environmental perception,” IEEE Intelligent Vehicles Symposium, 2023.
- O. Schumann, M. Hahn, J. Dickmann, and C. Wöhler, “Semantic segmentation on radar point clouds,” in International Conference on Information Fusion, 2018, pp. 2179–2186.
- M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning, 2019, pp. 6105–6114.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- M. Käppeler, K. Petek, N. Vödisch, W. Burgard, and A. Valada, “Few-shot panoptic segmentation with foundation models,” in IEEE International Conference on Robotics and Automation, 2024.
- M. Xu, Z. Zhang, F. Wei, H. Hu, and X. Bai, “SAN: Side adapter network for open-vocabulary semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 546–15 561, 2023.
- Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai, and Y. Qiao, “Vision transformer adapter for dense predictions,” in International Conference on Learning Representations, 2023.
- T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 318–327, 2020.
- J. V. Hurtado and A. Valada, “Semantic scene segmentation for robotics,” in Deep Learning for Robot Perception and Cognition, 2022, pp. 279–311.