Addressing Diverging Training Costs using BEVRestore for High-resolution Bird's Eye View Map Construction (2405.01016v4)
Abstract: Recent advancements in Bird's Eye View (BEV) fusion for map construction have demonstrated remarkable mapping of urban environments. However, their deep and bulky architecture incurs substantial amounts of backpropagation memory and computing latency. Consequently, the problem poses an unavoidable bottleneck in constructing high-resolution (HR) BEV maps, as their large-sized features cause significant increases in costs including GPU memory consumption and computing latency, named diverging training costs issue. Affected by the problem, most existing methods adopt low-resolution (LR) BEV and struggle to estimate the precise locations of urban scene components like road lanes, and sidewalks. As the imprecision leads to risky motion planning like collision avoidance, the diverging training costs issue has to be resolved. In this paper, we address the issue with our novel BEVRestore mechanism. Specifically, our proposed model encodes the features of each sensor to LR BEV space and restores them to HR space to establish a memory-efficient map constructor. To this end, we introduce the BEV restoration strategy, which restores aliasing, and blocky artifacts of the up-scaled BEV features, and narrows down the width of the labels. Our extensive experiments show that the proposed mechanism provides a plug-and-play, memory-efficient pipeline, enabling an HR map construction with a broad BEV scope.
- X-align: Cross-modal cross-view alignment for bird’s-eye-view segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3287–3297, 2023.
- nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019.
- nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
- Bevmap: Map-aware bev modeling for 3d perception. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7419–7428, 2024.
- Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
- MMDetection3D Contributors. MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d, 2020.
- Metabev: Solving sensor failures for bev detection and map segmentation. arXiv preprint arXiv:2304.09801, 2023.
- Fishing net: Future inference of semantic heatmaps in grids. arXiv preprint arXiv:2006.09917, 2020.
- Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15273–15282, 2021.
- Broadbev: Collaborative lidar-camera fusion for broad-sighted bird’s eye view map construction. arXiv preprint arXiv:2309.11119, 2023.
- Crn: Camera radar net for accurate, robust, efficient 3d perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17615–17626, 2023.
- Hdmapnet: An online hd map construction and evaluation framework. In 2022 International Conference on Robotics and Automation (ICRA), pages 4628–4634. IEEE, 2022.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
- Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
- Sparse convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 806–814, 2015.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2774–2781. IEEE, 2023.
- Kevin Lynch. The image of the city. MIT press, 1964.
- Bev-guided multi-modality fusion for driving perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21960–21969, 2023.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
- End-to-end vectorized hd-map construction with piecewise bezier curve. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13218–13228, 2023.
- Machmap: End-to-end vectorized solution for compact hd-map construction. arXiv preprint arXiv:2306.10301, 2023.
- Orthographic feature transform for monocular 3d object detection. arXiv preprint arXiv:1811.08188, 2018.
- Translating images into maps. In 2022 International conference on robotics and automation (ICRA), pages 9200–9206. IEEE, 2022.
- Bevseg2tp: Surround view camera bird’s-eye-view based joint vehicle segmentation and ego vehicle trajectory prediction. arXiv preprint arXiv:2312.13081, 2023.
- Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016.
- Is the deconvolution layer the same as a convolutional layer? arXiv preprint arXiv:1609.07009, 2016.
- Instagram: Instance-level graph modeling for vectorized hd map learning. arXiv preprint arXiv:2301.04470, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4604–4612, 2020.
- Unitr: A unified and efficient multi-modal transformer for bird’s-eye-view representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6792–6802, 2023.
- Lidar2map: In defense of lidar-based semantic map construction using online camera distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5186–5195, 2023.
- Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17830–17839, 2023.
- Multimodal virtual point 3d detection. Advances in Neural Information Processing Systems, 34:16494–16507, 2021.
- Cross-view transformers for real-time map-view semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13760–13769, 2022.
- Mapprior: Bird’s-eye view map layout estimation with generative models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8228–8239, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.