HIMap: HybrId Representation Learning for End-to-end Vectorized HD Map Construction (2403.08639v2)
Abstract: Vectorized High-Definition (HD) map construction requires predictions of the category and point coordinates of map elements (e.g. road boundary, lane divider, pedestrian crossing, etc.). State-of-the-art methods are mainly based on point-level representation learning for regressing accurate point coordinates. However, this pipeline has limitations in obtaining element-level information and handling element-level failures, e.g. erroneous element shape or entanglement between elements. To tackle the above issues, we propose a simple yet effective HybrId framework named HIMap to sufficiently learn and interact both point-level and element-level information. Concretely, we introduce a hybrid representation called HIQuery to represent all map elements, and propose a point-element interactor to interactively extract and encode the hybrid information of elements, e.g. point position and element shape, into the HIQuery. Additionally, we present a point-element consistency constraint to enhance the consistency between the point-level and element-level information. Finally, the output point-element integrated HIQuery can be directly converted into map elements' class, point coordinates, and mask. We conduct extensive experiments and consistently outperform previous methods on both nuScenes and Argoverse2 datasets. Notably, our method achieves $77.8$ mAP on the nuScenes dataset, remarkably superior to previous SOTAs by $8.3$ mAP at least.
- Curveformer: 3d lane detection by curve propagation with curve queries and attention. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7062–7068. IEEE, 2023.
- Plop: Probabilistic polynomial objects trajectory planning for autonomous driving. arXiv preprint arXiv:2003.08744, 2020.
- nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Persformer: 3d lane detection via perspective transformer and the openlane benchmark. In European Conference on Computer Vision, pages 550–567. Springer, 2022a.
- Efficient and robust 2d-to-bev representation learning via geometry-guided kernel transformer. arXiv preprint arXiv:2206.04584, 2022b.
- Generating dynamic kernels via transformers for lane detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6835–6844, 2023.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
- A random finite set approach to multiple lane detection. In 2012 15th International IEEE Conference on Intelligent Transportation Systems, pages 270–275. IEEE, 2012.
- Pivotnet: Vectorized pivot learning for end-to-end hd map construction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3672–3682, 2023.
- Rethinking efficient lane detection via curve modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17062–17070, 2022.
- 3d-lanenet: end-to-end 3d multiple lane detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2921–2930, 2019.
- Exploring recurrent long-term temporal fusion for multi-view 3d perception. arXiv preprint arXiv:2303.05970, 2023.
- Mbfusion: A new multi-modal bev feature fusion method for hd map construction. In IEEE International Conference on Robotics and Automation, 2024.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15273–15282, 2021a.
- Safe local motion planning with self-supervised freespace forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12732–12741, 2021b.
- St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision, pages 533–549. Springer, 2022.
- Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
- Anchor3dlane: Learning to regress 3d anchors for monocular 3d lane detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17451–17460, 2023.
- Multi-lane detection in urban driving environments using conditional random fields. In 2013 IEEE Intelligent vehicles symposium (IV), pages 1297–1302. IEEE, 2013.
- Multi-task learning with attention for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2902–2911, 2021.
- An efficient lane detection algorithm for lane departure detection. In 2013 IEEE Intelligent vehicles symposium (IV), pages 976–981. IEEE, 2013.
- Predictionnet: Real-time joint probabilistic traffic prediction for planning, control, and simulation. In 2022 International Conference on Robotics and Automation (ICRA), pages 8936–8942. IEEE, 2022.
- Key points estimation and point instance segmentation approach for lane detection. IEEE Transactions on Intelligent Transportation Systems, 23(7):8949–8958, 2021.
- Vpgnet: Vanishing point guided network for lane and road marking detection and recognition. In Proceedings of the IEEE international conference on computer vision, pages 1947–1955, 2017.
- Deep neural network for structural prediction and lane detection in traffic scene. IEEE transactions on neural networks and learning systems, 28(3):690–703, 2016.
- Hdmapnet: An online hd map construction and evaluation framework. In 2022 International Conference on Robotics and Automation (ICRA), pages 4628–4634. IEEE, 2022a.
- Line-cnn: End-to-end traffic line detection with line proposal unit. IEEE Transactions on Intelligent Transportation Systems, 21(1):248–258, 2019.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022b.
- Pnpnet: End-to-end perception and prediction with tracking in the loop. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11553–11562, 2020.
- Maptr: Structured modeling and learning for online vectorized hd map construction. arXiv preprint arXiv:2208.14437, 2022.
- Maptrv2: An end-to-end framework for online vectorized hd map construction. arXiv preprint arXiv:2308.05736, 2023.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
- Condlanenet: a top-to-down lane detection framework based on conditional convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3773–3782, 2021a.
- End-to-end lane shape prediction with transformers. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3694–3702, 2021b.
- Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022.
- Multimodal motion prediction with stacked transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7577–7586, 2021c.
- Vectormapnet: End-to-end vectorized hd map learning. In International Conference on Machine Learning, pages 22352–22369. PMLR, 2023a.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021d.
- Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2774–2781. IEEE, 2023b.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- M^ 2-3dlanenet: Multi-modal 3d lane detection. arXiv preprint arXiv:2209.05996, 2022.
- Inverse perspective mapping simplifies optical flow computation and obstacle detection. Biological cybernetics, 64(3):177–185, 1991.
- Hdmapgen: A hierarchical graph generative model of high definition maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4227–4236, 2021.
- Scene transformer: A unified multi-task model for behavior prediction and planning. arXiv preprint arXiv:2106.08417, 2(7), 2021.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 194–210. Springer, 2020.
- Multi-modal fusion transformer for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7077–7087, 2021.
- End-to-end vectorized hd-map construction with piecewise bezier curve. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13218–13228, 2023.
- Ultra fast structure-aware deep lane detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pages 276–291. Springer, 2020.
- Keep your eyes on the lane: Real-time attention-guided lane detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 294–302, 2021a.
- Polylanenet: Lane estimation via deep polynomial regression. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 6150–6156. IEEE, 2021b.
- A novel curve lane detection based on improved river flow and ransa. In 17th international ieee conference on intelligent transportation systems (itsc), pages 133–138. IEEE, 2014.
- End-to-end lane detection through differentiable least-squares fitting. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
- A keypoint-based global association network for lane detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1392–1401, 2022.
- Bev-lanedet: An efficient 3d lane detection based on virtual camera via key-points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1002–1011, 2023.
- Argoverse 2: Next generation datasets for self-driving perception and forecasting. arXiv preprint arXiv:2301.00493, 2023.
- Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. Advances in Neural Information Processing Systems, 35:6119–6132, 2022.
- Pix2map: Cross-modal retrieval for inferring street maps from images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17514–17523, 2023.
- Multimodal end-to-end autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 23(1):537–547, 2020.
- Neural map prior for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17535–17544, 2023.
- Insightmapper: A closer look at inner-instance information for vectorized high-definition mapping. arXiv preprint arXiv:2308.08543, 2023a.
- Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412, 2023b.
- Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
- Sparse point guided 3d lane detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8363–8372, 2023.
- Online map vectorization for autonomous driving: A rasterization perspective. arXiv preprint arXiv:2306.10502, 2023.
- Mutr3d: A multi-camera tracking framework via 3d-to-2d queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4537–4546, 2022.
- Clrnet: Cross layer refinement network for lane detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 898–907, 2022.
- Cross-view transformers for real-time map-view semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13760–13769, 2022.
- A novel lane detection based on geometrical model and gabor filter. In 2010 IEEE Intelligent Vehicles Symposium, pages 59–64. IEEE, 2010.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.