MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction (2404.00876v1)
Abstract: Currently, high-definition (HD) map construction leans towards a lightweight online generation tendency, which aims to preserve timely and reliable road scene information. However, map elements contain strong shape priors. Subtle and sparse annotations make current detection-based frameworks ambiguous in locating relevant feature scopes and cause the loss of detailed structures in prediction. To alleviate these problems, we propose MGMap, a mask-guided approach that effectively highlights the informative regions and achieves precise map element localization by introducing the learned masks. Specifically, MGMap employs learned masks based on the enhanced multi-scale BEV features from two perspectives. At the instance level, we propose the Mask-activated instance (MAI) decoder, which incorporates global instance and structural information into instance queries by the activation of instance masks. At the point level, a novel position-guided mask patch refinement (PG-MPR) module is designed to refine point locations from a finer-grained perspective, enabling the extraction of point-specific patch information. Compared to the baselines, our proposed MGMap achieves a notable improvement of around 10 mAP for different input modalities. Extensive experiments also demonstrate that our approach showcases strong robustness and generalization capabilities. Our code can be found at https://github.com/xiaolul2/MGMap.
- nuScenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Structured bird’s-eye-view traffic scene understanding from onboard images. In ICCV, pages 15661–15670, 2021.
- Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, pages 17864–17875, 2021.
- Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022a.
- Boundary-preserving mask r-cnn. In ECCV, pages 660–676. Springer, 2020.
- Sparse instance activation for real-time instance segmentation. In CVPR, pages 4433–4442, 2022b.
- Path-aware graph attention for hd maps in motion prediction. In ICRA, pages 6430–6436. IEEE, 2022.
- Hgformer: Hierarchical grouping transformer for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15413–15423, 2023a.
- Pivotnet: Vectorized pivot learning for end-to-end hd map construction. In ICCV, pages 3672–3682, 2023b.
- Superfusion: Multilevel lidar-camera fusion for long-range hd map generation and prediction. arXiv preprint arXiv:2211.15656, 2022.
- Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In CVPR, pages 11525–11533, 2020.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Mask r-cnn. In ICCV, pages 2961–2969, 2017.
- Egovm: Achieving precise ego-localization using lightweight vectorized maps. arXiv preprint arXiv:2307.08991, 2023.
- Planning-oriented autonomous driving. In CVPR, pages 17853–17862, 2023.
- Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
- Perceive, interact, predict: Learning dynamic and static clues for end-to-end motion prediction. arXiv preprint arXiv:2212.02181, 2022.
- Map-based precision vehicle localization in urban environments. In Robotics: science and systems, page 1. Atlanta, GA, USA, 2007.
- Hdmapnet: An online hd map construction and evaluation framework. In ICRA, pages 4628–4634. IEEE, 2022a.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18. Springer, 2022b.
- Polytransform: Deep polygon transformer for instance segmentation. In CVPR, pages 9131–9140, 2020.
- Bevfusion: A simple and robust lidar-camera fusion framework. In NeurIPS, 2022.
- Maptr: Structured modeling and learning for online vectorized hd map construction. In ICLR, 2022.
- Maptrv2: An end-to-end framework for online vectorized hd map construction. arXiv preprint arXiv:2308.05736, 2023.
- Petr: Position embedding transformation for multi-view 3d object detection. In ECCV, pages 531–548. Springer, 2022.
- Vectormapnet: End-to-end vectorized hd map learning. In ICML, pages 22352–22369. PMLR, 2023a.
- Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, pages 2774–2781. IEEE, 2023b.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, pages 565–571. Ieee, 2016.
- Cross-view semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters, 5(3):4867–4873, 2020.
- Bevsegformer: Bird’s eye view semantic segmentation from arbitrary camera rigs. In WACV, pages 5935–5943, 2023.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, pages 194–210. Springer, 2020.
- End-to-end vectorized hd-map construction with piecewise bezier curve. In CVPR, pages 13218–13228, 2023.
- Unifusion: Unified multi-view fusion transformer for spatial-temporal representation in bird’s-eye-view. In ICCV, pages 8690–8699, 2023.
- On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
- A sim2real deep learning approach for the transformation of images from multiple vehicle-mounted cameras to a semantically segmented image in bird’s eye view. In ITSC, pages 1–7. IEEE, 2020.
- Predicting semantic map representations from images using pyramid occupancy networks. In CVPR, pages 11138–11147, 2020.
- Lego-loam: Lightweight and ground-optimized lidar odometry and mapping on variable terrain. In IROS, pages 4758–4765. IEEE, 2018.
- Lio-sam: Tightly-coupled lidar inertial odometry via smoothing and mapping. In IROS, pages 5135–5142. IEEE, 2020.
- Instagram: Instance-level graph modeling for vectorized hd map learning. In CVPRW, 2023.
- Gated-scnn: Gated shape cnns for semantic segmentation. In ICCV, pages 5229–5238, 2019.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114. PMLR, 2019.
- Look closer to segment better: Boundary patch refinement for instance segmentation. In CVPR, pages 13926–13935, 2021.
- Lidar2map: In defense of lidar-based semantic map construction using online camera distillation. In CVPR, pages 5186–5195, 2023.
- Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, pages 913–922, 2021.
- Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In CoRL, pages 180–191. PMLR, 2022.
- Argoverse 2: Next generation datasets for self-driving perception and forecasting. In NeurIPS, 2021.
- Cbam: Convolutional block attention module. In ECCV, pages 3–19, 2018.
- Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
- Online map vectorization for autonomous driving: A rasterization perspective. In NeurIPS, 2023.
- Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022.
- Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.