Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction (2404.00876v1)

Published 1 Apr 2024 in cs.CV

Abstract: Currently, high-definition (HD) map construction leans towards a lightweight online generation tendency, which aims to preserve timely and reliable road scene information. However, map elements contain strong shape priors. Subtle and sparse annotations make current detection-based frameworks ambiguous in locating relevant feature scopes and cause the loss of detailed structures in prediction. To alleviate these problems, we propose MGMap, a mask-guided approach that effectively highlights the informative regions and achieves precise map element localization by introducing the learned masks. Specifically, MGMap employs learned masks based on the enhanced multi-scale BEV features from two perspectives. At the instance level, we propose the Mask-activated instance (MAI) decoder, which incorporates global instance and structural information into instance queries by the activation of instance masks. At the point level, a novel position-guided mask patch refinement (PG-MPR) module is designed to refine point locations from a finer-grained perspective, enabling the extraction of point-specific patch information. Compared to the baselines, our proposed MGMap achieves a notable improvement of around 10 mAP for different input modalities. Extensive experiments also demonstrate that our approach showcases strong robustness and generalization capabilities. Our code can be found at https://github.com/xiaolul2/MGMap.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. nuScenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  2. Structured bird’s-eye-view traffic scene understanding from onboard images. In ICCV, pages 15661–15670, 2021.
  3. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, pages 17864–17875, 2021.
  4. Masked-attention mask transformer for universal image segmentation. In CVPR, pages 1290–1299, 2022a.
  5. Boundary-preserving mask r-cnn. In ECCV, pages 660–676. Springer, 2020.
  6. Sparse instance activation for real-time instance segmentation. In CVPR, pages 4433–4442, 2022b.
  7. Path-aware graph attention for hd maps in motion prediction. In ICRA, pages 6430–6436. IEEE, 2022.
  8. Hgformer: Hierarchical grouping transformer for domain generalized semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15413–15423, 2023a.
  9. Pivotnet: Vectorized pivot learning for end-to-end hd map construction. In ICCV, pages 3672–3682, 2023b.
  10. Superfusion: Multilevel lidar-camera fusion for long-range hd map generation and prediction. arXiv preprint arXiv:2211.15656, 2022.
  11. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In CVPR, pages 11525–11533, 2020.
  12. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  13. Mask r-cnn. In ICCV, pages 2961–2969, 2017.
  14. Egovm: Achieving precise ego-localization using lightweight vectorized maps. arXiv preprint arXiv:2307.08991, 2023.
  15. Planning-oriented autonomous driving. In CVPR, pages 17853–17862, 2023.
  16. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790, 2021.
  17. Perceive, interact, predict: Learning dynamic and static clues for end-to-end motion prediction. arXiv preprint arXiv:2212.02181, 2022.
  18. Map-based precision vehicle localization in urban environments. In Robotics: science and systems, page 1. Atlanta, GA, USA, 2007.
  19. Hdmapnet: An online hd map construction and evaluation framework. In ICRA, pages 4628–4634. IEEE, 2022a.
  20. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18. Springer, 2022b.
  21. Polytransform: Deep polygon transformer for instance segmentation. In CVPR, pages 9131–9140, 2020.
  22. Bevfusion: A simple and robust lidar-camera fusion framework. In NeurIPS, 2022.
  23. Maptr: Structured modeling and learning for online vectorized hd map construction. In ICLR, 2022.
  24. Maptrv2: An end-to-end framework for online vectorized hd map construction. arXiv preprint arXiv:2308.05736, 2023.
  25. Petr: Position embedding transformation for multi-view 3d object detection. In ECCV, pages 531–548. Springer, 2022.
  26. Vectormapnet: End-to-end vectorized hd map learning. In ICML, pages 22352–22369. PMLR, 2023a.
  27. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, pages 2774–2781. IEEE, 2023b.
  28. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, pages 565–571. Ieee, 2016.
  29. Cross-view semantic segmentation for sensing surroundings. IEEE Robotics and Automation Letters, 5(3):4867–4873, 2020.
  30. Bevsegformer: Bird’s eye view semantic segmentation from arbitrary camera rigs. In WACV, pages 5935–5943, 2023.
  31. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, pages 194–210. Springer, 2020.
  32. End-to-end vectorized hd-map construction with piecewise bezier curve. In CVPR, pages 13218–13228, 2023.
  33. Unifusion: Unified multi-view fusion transformer for spatial-temporal representation in bird’s-eye-view. In ICCV, pages 8690–8699, 2023.
  34. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
  35. A sim2real deep learning approach for the transformation of images from multiple vehicle-mounted cameras to a semantically segmented image in bird’s eye view. In ITSC, pages 1–7. IEEE, 2020.
  36. Predicting semantic map representations from images using pyramid occupancy networks. In CVPR, pages 11138–11147, 2020.
  37. Lego-loam: Lightweight and ground-optimized lidar odometry and mapping on variable terrain. In IROS, pages 4758–4765. IEEE, 2018.
  38. Lio-sam: Tightly-coupled lidar inertial odometry via smoothing and mapping. In IROS, pages 5135–5142. IEEE, 2020.
  39. Instagram: Instance-level graph modeling for vectorized hd map learning. In CVPRW, 2023.
  40. Gated-scnn: Gated shape cnns for semantic segmentation. In ICCV, pages 5229–5238, 2019.
  41. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114. PMLR, 2019.
  42. Look closer to segment better: Boundary patch refinement for instance segmentation. In CVPR, pages 13926–13935, 2021.
  43. Lidar2map: In defense of lidar-based semantic map construction using online camera distillation. In CVPR, pages 5186–5195, 2023.
  44. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, pages 913–922, 2021.
  45. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In CoRL, pages 180–191. PMLR, 2022.
  46. Argoverse 2: Next generation datasets for self-driving perception and forecasting. In NeurIPS, 2021.
  47. Cbam: Convolutional block attention module. In ECCV, pages 3–19, 2018.
  48. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
  49. Online map vectorization for autonomous driving: A rasterization perspective. In NeurIPS, 2023.
  50. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022.
  51. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR, 2020.
Citations (10)

Summary

  • The paper presents a novel mask-guided framework that refines HD map construction by enhancing mask-activated decoding and point-level localization.
  • The approach integrates multi-scale BEV feature extraction with attention mechanisms to overcome challenges from sparse annotations, achieving approximately a 10 mAP improvement.
  • MGMap offers a scalable, real-time mapping solution crucial for autonomous driving, accurately delineating lanes, road boundaries, and pedestrian crossings.

An Analysis of MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction

The paper entitled "MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction" presents a novel approach to develop a high-precision, real-time vectorized high-definition (HD) map construction framework essential for autonomous driving applications. This research focuses on addressing inherent challenges in current map constructions, primarily those caused by sparse and ambiguous annotations that lead to suboptimal feature localization and the loss of detailed structures.

Methodological Framework and Innovations

The proposed MGMap approach introduces a mask-guided learning framework, employing learned masks for enhanced localization of map elements. The framework operates in three major phases: BEV Feature Extraction, Mask-Activated Instance (MAI) Decoder, and Position-Guided Mask Patch Refinement (PG-MPR) module.

  1. BEV Feature Extraction: The methodology employs a pyramidal structure named the Enhanced Multi-Level (EML) neck to amalgamate multi-scale Bird’s-Eye-View (BEV) features, thereby yielding a comprehensive representation of the driving environment. The EML neck, facilitated by channel and spatial attention mechanisms, captures diverse semantic and location attributes, requisite for precise localization of complex map structures.
  2. Mask-Activated Instance Decoder: The technique introduces mask-activated queries initialized from dynamically generated instance masks. These masks enhance the process of harnessing instance-specific information, allowing the decoder to augment query embeddings with both shape priors and global instance attributes, promoting a refined understanding of map elements.
  3. Position-Guided Mask Patch Refinement: The PG-MPR module addresses localized point-level refinement by drawing on dense patch features extracted from binary masks. This step refines point positions through interaction with local mask patches, thus boosting the fidelity of structure delineation and precise localization.

Numerical Results and Performance Insights

Experimental results on the nuScenes and Argoverse2 datasets underline MGMap's robustness and state-of-the-art performance across varying input modalities, including camera, LiDAR, and their fusion. MGMap achieves an approximate improvement of 10 mean Average Precision (mAP) over baseline models like MapTR in terms of both Chamfer-distance based and raster-based metrics. Notably, MGMap demonstrates substantial generalization capability by maintaining high performance across diverse weather and lighting scenarios.

Implications and Future Directions

The implications of this research are manifold. Practically, MGMap offers a scalable solution for generating detailed, real-time vectorized HD maps essential for self-driving vehicles, capable of providing accurate lane, road boundary, and pedestrian crossing information. Theoretically, the mask-guided framework introduces a paradigm shift from coarse instance-level to fine-grained point-level representations, leveraging attention-based mechanisms to refine these representations in an end-to-end manner.

In future research, MGMap can be integrated with multi-modal and temporal data sources, facilitating even richer data representations, which can be pivotal in maneuvering complex driving scenarios. Exploring unsupervised or semi-supervised learning strategies could also enable the framework to adapt to unseen environments or novel geographical regions, further extending its practical applicability.

Conclusion

This paper delivers significant contributions to online HD map construction by harnessing mask-guided learning to address challenges posed by sparse annotations. MGMap sets a new benchmark in map vectorization, reflecting progress in the domain of autonomous driving, and offering avenues for widespread application in dynamic mapping and real-time navigation systems.

Youtube Logo Streamline Icon: https://streamlinehq.com