RoadBEV: Road Surface Reconstruction in Bird's Eye View (2404.06605v3)
Abstract: Road surface conditions, especially geometry profiles, enormously affect driving performance of autonomous vehicles. Vision-based online road reconstruction promisingly captures road information in advance. Existing solutions like monocular depth estimation and stereo matching suffer from modest performance. The recent technique of Bird's-Eye-View (BEV) perception provides immense potential to more reliable and accurate reconstruction. This paper uniformly proposes two simple yet effective models for road elevation reconstruction in BEV named RoadBEV-mono and RoadBEV-stereo, which estimate road elevation with monocular and stereo images, respectively. The former directly fits elevation values based on voxel features queried from image view, while the latter efficiently recognizes road elevation patterns based on BEV volume representing correlation between left and right voxel features. Insightful analyses reveal their consistence and difference with the perspective view. Experiments on real-world dataset verify the models' effectiveness and superiority. Elevation errors of RoadBEV-mono and RoadBEV-stereo achieve 1.83 cm and 0.50 cm, respectively. Our models are promising for practical road preview, providing essential information for promoting safety and comfort of autonomous vehicles. The code is released at https://github.com/ztsrxh/RoadBEV
- H. Marzbani, H. Khayyam, C. N. TO, D. V. Quoc, and R. N. Jazar, “Autonomous vehicles: Autodriver algorithm and vehicle dynamics,” IEEE Transactions on Vehicular Technology, vol. 68, no. 4, pp. 3201–3211, 2019.
- T. Zhao, P. Guo, and Y. Wei, “Road friction estimation based on vision for safe autonomous driving,” Mechanical Systems and Signal Processing, vol. 208, p. 111019, 2024.
- H.-H. Jebamikyous and R. Kashef, “Autonomous vehicles perception (avp) using deep learning: Modeling, assessment, and challenges,” IEEE Access, vol. 10, pp. 10 523–10 535, 2022.
- T. Zhao, J. He, J. Lv, D. Min, and Y. Wei, “A comprehensive implementation of road surface classification for vehicle driving assistance: Dataset, models, and deployment,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 8, pp. 8361–8370, 2023.
- T. Zhao, P. Guo, J. He, and Y. Wei, “A hierarchical scheme of road unevenness perception with lidar for autonomous driving comfort,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 2439–2448, 2024.
- C. Göhrle, A. Schindler, A. Wagner, and O. Sawodny, “Road profile estimation and preview control for low-bandwidth active suspension systems,” IEEE/ASME Transactions on Mechatronics, vol. 20, no. 5, pp. 2299–2310, 2015.
- T. Ni, W. Li, D. Zhao, and Z. Kong, “Road profile estimation using a 3d sensor and intelligent vehicle,” Sensors, vol. 20, no. 13, p. 3676, 2020.
- L. Wang, D. Zhao, T. Ni, and S. Liu, “Extraction of preview elevation information based on terrain mapping and trajectory prediction in real-time,” IEEE Access, vol. 8, pp. 76 618–76 631, 2020.
- L. Sun, H. Zhang, and W. Yin, “Pseudo-lidar-based road detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 8, pp. 5386–5398, 2022.
- G.-T. Michailidis, R. Pajarola, and I. Andreadis, “High performance stereo system for dense 3-d reconstruction,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 6, pp. 929–941, 2014.
- R. Fan, J. Jiao, J. Pan, H. Huang, S. Shen, and M. Liu, “Real-time dense stereo embedded in a uav for road inspection,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 535–543.
- T. Zhao, M. Ding, W. Zhan, M. Tomizuka, and Y. Wei, “Depth-aware volume attention for texture-less stereo matching,” arXiv preprint arXiv:2402.08931, 2024.
- H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng, H. Tian, E. Xie, J. Xie, L. Chen, T. Li, Y. Li, Y. Gao, X. Jia, S. Liu, J. Shi, D. Lin, and Y. Qiao, “Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 4, pp. 2151–2170, 2024.
- A. Barrera, C. Guindel, J. Beltrán, and F. García, “Birdnet+: End-to-end 3d object detection in lidar bird’s eye view,” in 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), 2020, pp. 1–6.
- J. Wang, F. Li, Y. An, X. Zhang, and H. Sun, “Towards robust lidar-camera fusion in bev space via mutual deformable attention and temporal aggregation,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2024.
- T. Shen, G. Schamp, and M. Haddad, “Stereo vision based road surface preview,” in 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), 2014, pp. 1843–1849.
- B. Li, Y. Guo, J. Zhou, Y. Cai, J. Xiao, and W. Zeng, “Lane detection and road surface reconstruction based on multiple vanishing point & symposia,” in 2018 IEEE Intelligent Vehicles Symposium (IV), 2018, pp. 209–214.
- A. Dhiman, H.-J. Chien, and R. Klette, “A multi-frame stereo vision-based road profiling technique for distress analysis,” in 2018 15th International Symposium on Pervasive Systems, Algorithms and Networks (I-SPAN), 2018, pp. 7–14.
- B. Jia, J. Chen, and K. Zhang, “Drivable road reconstruction for intelligent vehicles based on two-view geometry,” IEEE Transactions on Industrial Electronics, vol. 64, no. 5, pp. 3696–3706, 2017.
- Y. Feng, R. Zhang, and S. Zhai, “Road elevation map estimation based on affine transformation and stereo matching,” in Journal of Physics: Conference Series, vol. 1601, no. 6, 2020, p. 062015.
- D. Li and T. Furukawa, “Global vision-based reconstruction of three-dimensional road surfaces using adaptive extended kalman filter,” in 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 3860–3866.
- R. Mei, W. Sui, J. Zhang, Q. Zhang, T. Peng, and C. Yang, “Rome: Towards large scale road surface reconstruction via mesh representation,” arXiv preprint arXiv:2306.11368, 2023.
- W. Wu, Q. Wang, G. Wang, J. Wang, T. Zhao, Y. Liu, D. Gao, Z. Liu, and H. Wang, “Emie-map: Large-scale road surface reconstruction based on explicit mesh and implicit encoding,” arXiv preprint arXiv:2403.11789, 2024.
- L. Wang, X. Zhang, Z. Song, J. Bi, G. Zhang, H. Wei, L. Tang, L. Yang, J. Li, C. Jia, et al., “Multi-modal 3d object detection in autonomous driving: A survey and taxonomy,” IEEE Transactions on Intelligent Vehicles, 2023.
- H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng, et al., “Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Z. Song, L. Yang, S. Xu, L. Liu, D. Xu, C. Jia, F. Jia, and L. Wang, “Graphbev: Towards robust bev feature alignment for multi-modal 3d object detection,” arXiv preprint arXiv:2403.11848, 2024.
- J. Huang, G. Huang, Z. Zhu, and D. Du, “Bevdet: High-performance multi-camera 3d object detection in bird-eye-view,” arXiv preprint arXiv:2112.11790, 2021.
- Y. Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y. Shi, J. Sun, and Z. Li, “Bevdepth: Acquisition of reliable depth for multi-view 3d object detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1477–1485.
- Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in European conference on computer vision. Springer, 2022, pp. 1–18.
- L. Yang, K. Yu, T. Tang, J. Li, K. Yuan, L. Wang, X. Zhang, and P. Chen, “Bevheight: A robust framework for vision-based roadside 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 611–21 620.
- L. Yang, T. Tang, J. Li, P. Chen, K. Yuan, L. Wang, Y. Huang, X. Zhang, and K. Yu, “Bevheight++: Toward robust visual centric 3d object detection,” arXiv preprint arXiv:2309.16179, 2023.
- W. Tong, C. Sima, T. Wang, L. Chen, S. Wu, H. Deng, Y. Gu, L. Lu, P. Luo, D. Lin, et al., “Scene as occupancy,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8406–8415.
- Y. Huang, W. Zheng, Y. Zhang, J. Zhou, and J. Lu, “Tri-perspective view for vision-based 3d semantic occupancy prediction,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 9223–9232.
- Y. Wang, H. Pan, J. Zhu, Y.-H. Wu, X. Zhan, K. Jiang, and D. Yang, “Be-sti: Spatial-temporal integrated network for class-agnostic motion prediction with bidirectional enhancement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 093–17 102.
- Q. Li, Y. Wang, Y. Wang, and H. Zhao, “Hdmapnet: An online hd map construction and evaluation framework,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 4628–4634.
- J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” in Proceedings of the European Conference on Computer Vision, 2020.
- C. Reading, A. Harakeh, J. Chae, and S. L. Waslander, “Categorical depth distribution network for monocular 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8555–8564.
- J. Huang and G. Huang, “Bevdet4d: Exploit temporal cues in multi-camera 3d object detection,” arXiv preprint arXiv:2203.17054, 2022.
- Y. Li, H. Bao, Z. Ge, J. Yang, J. Sun, and Z. Li, “Bevstereo: Enhancing depth estimation in multi-view 3d object detection with dynamic temporal stereo,” arXiv preprint arXiv:2209.10248, 2022.
- T. Zhao, C. Xu, M. Ding, M. Tomizuka, W. Zhan, and Y. Wei, “Rsrd: A road surface reconstruction dataset and benchmark for safe and comfortable autonomous driving,” arXiv preprint arXiv:2310.02262, 2023.
- M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 6105–6114.
- E. Xie, Z. Yu, D. Zhou, J. Philion, A. Anandkumar, S. Fidler, P. Luo, and J. M. Alvarez, “M2bev: Multi-camera joint 3d detection and segmentation with unified birds-eye view representation,” arXiv preprint arXiv:2204.05088, 2022.
- S. Farooq Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 4008–4017.
- M. Song, S. Lim, and W. Kim, “Monocular depth estimation using laplacian pyramid-based depth residuals,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 11, pp. 4381–4393, 2021.
- A. Agarwal and C. Arora, “Attention attention everywhere: Monocular depth prediction with skip attention,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2023, pp. 5861–5870.
- L. Piccinelli, C. Sakaridis, and F. Yu, “idisc: Internal discretization for monocular depth estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- G. Xu, X. Wang, X. Ding, and X. Yang, “Iterative geometry encoding volume for stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 919–21 928.
- J.-R. Chang and Y.-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5410–5418.
- Z. Shen, Y. Dai, and Z. Rao, “Cfnet: Cascade and fused cost volume for robust stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 13 906–13 915.
- G. Xu, J. Cheng, P. Guo, and X. Yang, “Attention concatenation volume for accurate and efficient stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 981–12 990.
- X. Guo, K. Yang, W. Yang, X. Wang, and H. Li, “Group-wise correlation stereo network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3273–3282.
- B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
- B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Trans. Graph., vol. 42, no. 4, pp. 1–14, jul 2023.