Papers
Topics
Authors
Recent
2000 character limit reached

Improved Single Camera BEV Perception Using Multi-Camera Training

Published 4 Sep 2024 in cs.CV | (2409.02676v1)

Abstract: Bird's Eye View (BEV) map prediction is essential for downstream autonomous driving tasks like trajectory prediction. In the past, this was accomplished through the use of a sophisticated sensor configuration that captured a surround view from multiple cameras. However, in large-scale production, cost efficiency is an optimization goal, so that using fewer cameras becomes more relevant. But the consequence of fewer input images correlates with a performance drop. This raises the problem of developing a BEV perception model that provides a sufficient performance on a low-cost sensor setup. Although, primarily relevant for inference time on production cars, this cost restriction is less problematic on a test vehicle during training. Therefore, the objective of our approach is to reduce the aforementioned performance drop as much as possible using a modern multi-camera surround view model reduced for single-camera inference. The approach includes three features, a modern masking technique, a cyclic Learning Rate (LR) schedule, and a feature reconstruction loss for supervising the transition from six-camera inputs to one-camera input during training. Our method outperforms versions trained strictly with one camera or strictly with six-camera surround view for single-camera inference resulting in reduced hallucination and better quality of the BEV map.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds.   Cham: Springer Nature Switzerland, July 2022, pp. 1–18. [Online]. Available: http://arxiv.org/abs/2203.17270
  2. C. Yang, Y. Chen, H. Tian, C. Tao, X. Zhu, Z. Zhang, G. Huang, H. Li, Y. Qiao, L. Lu, J. Zhou, and J. Dai, “BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Los Alamitos, CA, USA: IEEE Computer Society, June 2023, pp. 17 830–17 839. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR52729.2023.01710
  3. B. Pan, J. Sun, H. Y. T. Leung, A. Andonian, and B. Zhou, “Cross-view Semantic Segmentation for Sensing Surroundings,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4867–4873, July 2020, arXiv:1906.03560 [cs, eess]. [Online]. Available: http://arxiv.org/abs/1906.03560
  4. Y. Jiang, L. Zhang, Z. Miao, X. Zhu, J. Gao, W. Hu, and Y.-G. Jiang, “PolarFormer: Multi-Camera 3D Object Detection with Polar Transformer,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1042–1050, June 2023. [Online]. Available: http://arxiv.org/abs/2206.15398
  5. Z. Li, Z. Yu, W. Wang, A. Anandkumar, T. Lu, and J. M. Alvarez, “FB-BEV: BEV Representation from Forward-Backward View Transformations,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV).   Paris, France: IEEE, Oct. 2023, pp. 6896–6905. [Online]. Available: https://ieeexplore.ieee.org/document/10377354/
  6. A. W. Harley, Z. Fang, J. Li, R. Ambrus, and K. Fragkiadaki, “Simple-BEV: What Really Matters for Multi-Sensor BEV Perception?” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 2759–2765. [Online]. Available: https://ieeexplore.ieee.org/document/10160831
  7. J. Li, M. Lu, J. Liu, Y. Guo, Y. Du, L. Du, and S. Zhang, “BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for Multi-View BEV 3D Object Detection,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 2489–2498, 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10264110
  8. Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudo-LiDAR From Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Long Beach, CA, USA: IEEE, June 2019, pp. 8437–8445. [Online]. Available: https://ieeexplore.ieee.org/document/8954293/
  9. H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng, H. Tian, E. Xie, J. Xie, L. Chen, T. Li, Y. Li, Y. Gao, X. Jia, S. Liu, J. Shi, D. Lin, and Y. Qiao, “Delving Into the Devils of Bird’s-Eye-View Perception: A Review, Evaluation and Recipe,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 4, pp. 2151–2170, Apr. 2024. [Online]. Available: https://ieeexplore.ieee.org/document/10321736/
  10. H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuScenes: A Multimodal Dataset for Autonomous Driving,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Los Alamitos, CA, USA: IEEE Computer Society, June 2020, pp. 11 618–11 628. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/CVPR42600.2020.01164
  11. T. Roddick and R. Cipolla, “Predicting Semantic Map Representations From Images Using Pyramid Occupancy Networks,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Seattle, WA, USA: IEEE, June 2020, pp. 11 135–11 144. [Online]. Available: https://ieeexplore.ieee.org/document/9156806/
  12. A. Baevski, A. Babu, W.-N. Hsu, and M. Auli, “Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language,” Proceedings of the 40 th International Conference on Machine Learning, vol. 40, Dec. 2022. [Online]. Available: http://arxiv.org/abs/2212.07525
  13. H. Rashed, M. Essam, M. I. Mohamed, A. E. Sallab, and S. K. Yogamani, “BEV-MODNet: Monocular Camera based Bird’s Eye View Moving Object Detection for Autonomous Driving,” 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pp. 1503–1508, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:235795365
  14. L. Peng, F. Liu, Z. Yu, S. Yan, D. Deng, Z. Yang, H. Liu, and D. Cai, “Lidar Point Cloud Guided Monocular 3D Object Detection,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds.   Cham: Springer Nature Switzerland, 2022, vol. 13661, pp. 123–139, series Title: Lecture Notes in Computer Science. [Online]. Available: https://link.springer.com/10.1007/978-3-031-19769-7_8
  15. A. Saha, O. Mendez, C. Russell, and R. Bowden, “Translating Images into Maps,” in 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 9200–9206. [Online]. Available: http://arxiv.org/abs/2110.00966
  16. C. Han, J. Sun, Z. Ge, J. Yang, R. Dong, H. Zhou, W. Mao, Y. Peng, and X. Zhang, “Exploring Recurrent Long-term Temporal Fusion for Multi-view 3D Perception,” Mar. 2023, arXiv:2303.05970 [cs]. [Online]. Available: http://arxiv.org/abs/2303.05970
  17. J. Park, C. Xu, S. Yang, K. Keutzer, K. Kitani, M. Tomizuka, and W. Zhan, “Time Will Tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection,” The Eleventh International Conference on Learning Representations, vol. 11, 2023. [Online]. Available: http://arxiv.org/abs/2210.02443
  18. Y. Li, H. Bao, Z. Ge, J. Yang, J. Sun, and Z. Li, “BEVStereo: Enhancing Depth Estimation in Multi-view 3D Object Detection with Dynamic Temporal Stereo,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 1486–1494, June 2023. [Online]. Available: https://doi.org/10.1609/aaai.v37i2.25234
  19. Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, and Z. Li, “BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 2, no. 37, pp. 1477–1485, June 2023. [Online]. Available: https://doi.org/10.1609/aaai.v37i2.25233
  20. Y. Liu, J. Yan, F. Jia, S. Li, A. Gao, T. Wang, and X. Zhang, “PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images,” in 2023 IEEE/CVF International Conference on Computer Vision (ICCV).   Los Alamitos, CA, USA: IEEE Computer Society, Oct. 2023, pp. 3239–3249. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/ICCV51070.2023.00302
  21. Y. Liu, T. Wang, X. Zhang, and J. Sun, “PETR: Position Embedding Transformation for Multi-view 3D Object Detection,” in Computer Vision – ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds.   Cham: Springer Nature Switzerland, 2022, vol. 13687, pp. 531–548, series Title: Lecture Notes in Computer Science. [Online]. Available: https://link.springer.com/10.1007/978-3-031-19812-0_31
  22. K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. [Online]. Available: https://ieeexplore.ieee.org/document/7780459
  23. X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.   OpenReview.net, 2021. [Online]. Available: https://openreview.net/forum?id=gZ9hCDWe6ke
  24. Bin-ze, “BEVFormer_segmentation_detection,” May 2023. [Online]. Available: https://github.com/Bin-ze/BEVFormer_segmentation_detection/tree/master
  25. S. Woo, S. Debnath, R. Hu, X. Chen, Z. Liu, I. S. Kweon, and S. Xie, “ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 16 133–16 142. [Online]. Available: https://ieeexplore.ieee.org/document/10205236
  26. X. Li, W. Wang, L. Yang, and J. Yang, “Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality,” May 2022, arXiv:2205.10063 [cs]. [Online]. Available: http://arxiv.org/abs/2205.10063
  27. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning Robust Visual Features without Supervision,” Transactions on Machine Learning Research, Feb. 2024. [Online]. Available: https://openreview.net/forum?id=a68SUt6zFt
  28. L. N. Smith, “Cyclical Learning Rates for Training Neural Networks,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 2017, pp. 464–472. [Online]. Available: 10.1109/WACV.2017.58
  29. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.