Improved Single Camera BEV Perception Using Multi-Camera Training
Abstract: Bird's Eye View (BEV) map prediction is essential for downstream autonomous driving tasks like trajectory prediction. In the past, this was accomplished through the use of a sophisticated sensor configuration that captured a surround view from multiple cameras. However, in large-scale production, cost efficiency is an optimization goal, so that using fewer cameras becomes more relevant. But the consequence of fewer input images correlates with a performance drop. This raises the problem of developing a BEV perception model that provides a sufficient performance on a low-cost sensor setup. Although, primarily relevant for inference time on production cars, this cost restriction is less problematic on a test vehicle during training. Therefore, the objective of our approach is to reduce the aforementioned performance drop as much as possible using a modern multi-camera surround view model reduced for single-camera inference. The approach includes three features, a modern masking technique, a cyclic Learning Rate (LR) schedule, and a feature reconstruction loss for supervising the transition from six-camera inputs to one-camera input during training. Our method outperforms versions trained strictly with one camera or strictly with six-camera surround view for single-camera inference resulting in reduced hallucination and better quality of the BEV map.
- Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, July 2022, pp. 1–18. [Online]. Available: http://arxiv.org/abs/2203.17270
- C. Yang, Y. Chen, H. Tian, C. Tao, X. Zhu, Z. Zhang, G. Huang, H. Li, Y. Qiao, L. Lu, J. Zhou, and J. Dai, “BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society, June 2023, pp. 17 830–17 839. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR52729.2023.01710
- B. Pan, J. Sun, H. Y. T. Leung, A. Andonian, and B. Zhou, “Cross-view Semantic Segmentation for Sensing Surroundings,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4867–4873, July 2020, arXiv:1906.03560 [cs, eess]. [Online]. Available: http://arxiv.org/abs/1906.03560
- Y. Jiang, L. Zhang, Z. Miao, X. Zhu, J. Gao, W. Hu, and Y.-G. Jiang, “PolarFormer: Multi-Camera 3D Object Detection with Polar Transformer,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1042–1050, June 2023. [Online]. Available: http://arxiv.org/abs/2206.15398
- Z. Li, Z. Yu, W. Wang, A. Anandkumar, T. Lu, and J. M. Alvarez, “FB-BEV: BEV Representation from Forward-Backward View Transformations,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Paris, France: IEEE, Oct. 2023, pp. 6896–6905. [Online]. Available: https://ieeexplore.ieee.org/document/10377354/
- A. W. Harley, Z. Fang, J. Li, R. Ambrus, and K. Fragkiadaki, “Simple-BEV: What Really Matters for Multi-Sensor BEV Perception?” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 2759–2765. [Online]. Available: https://ieeexplore.ieee.org/document/10160831
- J. Li, M. Lu, J. Liu, Y. Guo, Y. Du, L. Du, and S. Zhang, “BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for Multi-View BEV 3D Object Detection,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 2489–2498, 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10264110
- Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, CA, USA: IEEE, June 2019, pp. 8437–8445. [Online]. Available: https://ieeexplore.ieee.org/document/8954293/
- H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng, H. Tian, E. Xie, J. Xie, L. Chen, T. Li, Y. Li, Y. Gao, X. Jia, S. Liu, J. Shi, D. Lin, and Y. Qiao, “Delving Into the Devils of Bird’s-Eye-View Perception: A Review, Evaluation and Recipe,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 4, pp. 2151–2170, Apr. 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10321736/
- H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuScenes: A Multimodal Dataset for Autonomous Driving,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society, June 2020, pp. 11 618–11 628. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR42600.2020.01164
- T. Roddick and R. Cipolla, “Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, June 2020, pp. 11 135–11 144. [Online]. Available: https://ieeexplore.ieee.org/document/9156806/
- A. Baevski, A. Babu, W.-N. Hsu, and M. Auli, “Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language,” Proceedings of the 40 th International Conference on Machine Learning, vol. 40, Dec. 2022. [Online]. Available: http://arxiv.org/abs/2212.07525
- H. Rashed, M. Essam, M. I. Mohamed, A. E. Sallab, and S. K. Yogamani, “BEV-MODNet: Monocular Camera based Bird’s Eye View Moving Object Detection for Autonomous Driving,” 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pp. 1503–1508, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:235795365
- L. Peng, F. Liu, Z. Yu, S. Yan, D. Deng, Z. Yang, H. Liu, and D. Cai, “Lidar Point Cloud Guided Monocular 3D Object Detection,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, vol. 13661, pp. 123–139, series Title: Lecture Notes in Computer Science. [Online]. Available: https://link.springer.com/10.1007/978-3-031-19769-7_8
- A. Saha, O. Mendez, C. Russell, and R. Bowden, “Translating Images into Maps,” in 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 9200–9206. [Online]. Available: http://arxiv.org/abs/2110.00966
- C. Han, J. Sun, Z. Ge, J. Yang, R. Dong, H. Zhou, W. Mao, Y. Peng, and X. Zhang, “Exploring Recurrent Long-term Temporal Fusion for Multi-view 3D Perception,” Mar. 2023, arXiv:2303.05970 [cs]. [Online]. Available: http://arxiv.org/abs/2303.05970
- J. Park, C. Xu, S. Yang, K. Keutzer, K. Kitani, M. Tomizuka, and W. Zhan, “Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection,” The Eleventh International Conference on Learning Representations, vol. 11, 2023. [Online]. Available: http://arxiv.org/abs/2210.02443
- Y. Li, H. Bao, Z. Ge, J. Yang, J. Sun, and Z. Li, “BEVStereo: Enhancing Depth Estimation in Multi-view 3D Object Detection with Dynamic Temporal Stereo,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1486–1494, June 2023. [Online]. Available: https://doi.org/10.1609/aaai.v37i2.25234
- Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, and Z. Li, “BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 2, no. 37, pp. 1477–1485, June 2023. [Online]. Available: https://doi.org/10.1609/aaai.v37i2.25233
- Y. Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, and X. Zhang, “PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV). Los Alamitos, CA, USA: IEEE Computer Society, Oct. 2023, pp. 3239–3249. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICCV51070.2023.00302
- Y. Liu, T. Wang, X. Zhang, and J. Sun, “PETR: Position Embedding Transformation for Multi-view 3D Object Detection,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Cham: Springer Nature Switzerland, 2022, vol. 13687, pp. 531–548, series Title: Lecture Notes in Computer Science. [Online]. Available: https://link.springer.com/10.1007/978-3-031-19812-0_31
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. [Online]. Available: https://ieeexplore.ieee.org/document/7780459
- X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. [Online]. Available: https://openreview.net/forum?id=gZ9hCDWe6ke
- Bin-ze, “BEVFormer_segmentation_detection,” May 2023. [Online]. Available: https://github.com/Bin-ze/BEVFormer_segmentation_detection/tree/master
- S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 16 133–16 142. [Online]. Available: https://ieeexplore.ieee.org/document/10205236
- X. Li, W. Wang, L. Yang, and J. Yang, “Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality,” May 2022, arXiv:2205.10063 [cs]. [Online]. Available: http://arxiv.org/abs/2205.10063
- M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning Robust Visual Features without Supervision,” Transactions on Machine Learning Research, Feb. 2024. [Online]. Available: https://openreview.net/forum?id=a68SUt6zFt
- L. N. Smith, “Cyclical Learning Rates for Training Neural Networks,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 2017, pp. 464–472. [Online]. Available: 10.1109/WACV.2017.58
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.