Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MapNeXt: Revisiting Training and Scaling Practices for Online Vectorized HD Map Construction (2401.07323v1)

Published 14 Jan 2024 in cs.CV

Abstract: High-Definition (HD) maps are pivotal to autopilot navigation. Integrating the capability of lightweight HD map construction at runtime into a self-driving system recently emerges as a promising direction. In this surge, vision-only perception stands out, as a camera rig can still perceive the stereo information, let alone its appealing signature of portability and economy. The latest MapTR architecture solves the online HD map construction task in an end-to-end fashion but its potential is yet to be explored. In this work, we present a full-scale upgrade of MapTR and propose MapNeXt, the next generation of HD map learning architecture, delivering major contributions from the model training and scaling perspectives. After shedding light on the training dynamics of MapTR and exploiting the supervision from map elements thoroughly, MapNeXt-Tiny raises the mAP of MapTR-Tiny from 49.0% to 54.8%, without any architectural modifications. Enjoying the fruit of map segmentation pre-training, MapNeXt-Base further lifts the mAP up to 63.9% that has already outperformed the prior art, a multi-modality MapTR, by 1.4% while being $\sim1.8\times$ faster. Towards pushing the performance frontier to the next level, we draw two conclusions on practical model scaling: increased query favors a larger decoder network for adequate digestion; a large backbone steadily promotes the final accuracy without bells and whistles. Building upon these two rules of thumb, MapNeXt-Huge achieves state-of-the-art performance on the challenging nuScenes benchmark. Specifically, we push the mapless vision-only single-model performance to be over 78% for the first time, exceeding the best model from existing methods by 16%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. BEit: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022.
  2. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  3. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  4. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing.
  5. Group detr: Fast detr training with group-wise one-to-many assignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6633–6642, October 2023.
  6. Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer. arXiv e-prints, page arXiv:2206.04584, June 2022.
  7. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  9. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  10. Rethinking imagenet pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  11. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  12. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  13. Deep networks with stochastic depth. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 646–661, Cham, 2016. Springer International Publishing.
  14. BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View. arXiv e-prints, page arXiv:2112.11790, December 2021.
  15. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.
  16. H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, March 1955.
  17. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  18. Centermask: Real-time anchor-free instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  19. Hdmapnet: An online hd map construction and evaluation framework. In 2022 International Conference on Robotics and Automation (ICRA), pages 4628–4634, 2022.
  20. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 1–18, Cham, 2022. Springer Nature Switzerland.
  21. MapTR: Structured modeling and learning for online vectorized HD map construction. In The Eleventh International Conference on Learning Representations, 2023.
  22. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  23. Condlanenet: A top-to-down lane detection framework based on conditional convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3773–3782, October 2021.
  24. VectorMapNet: End-to-end vectorized HD map learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 22352–22369. PMLR, 23–29 Jul 2023.
  25. Petr: Position embedding transformation for multi-view 3d object detection. In Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision – ECCV 2022, pages 531–548, Cham, 2022. Springer Nature Switzerland.
  26. Petrv2: A unified framework for 3d perception from multi-camera images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3262–3272, October 2023.
  27. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12009–12019, June 2022.
  28. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, October 2021.
  29. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11976–11986, June 2022.
  30. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  31. Inverse perspective mapping simplifies optical flow computation and obstacle detection. Biol. Cybern., 64(3):177–185, jan 1991.
  32. Mixed precision training. In International Conference on Learning Representations, 2018.
  33. Is pseudo-lidar needed for monocular 3d object detection? In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3142–3152, October 2021.
  34. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  35. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 194–210, Cham, 2020. Springer International Publishing.
  36. Categorical depth distribution network for monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8555–8564, June 2021.
  37. How to train your vit? data, augmentation, and regularization in vision transformers. Transactions on Machine Learning Research, 2022.
  38. EfficientNet: Rethinking model scaling for convolutional neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6105–6114. PMLR, 09–15 Jun 2019.
  39. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  40. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 913–922, October 2021.
  41. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14408–14419, June 2023.
  42. Anchor detr: Query design for transformer-based detector. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3):2567–2575, Jun. 2022.
  43. Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In Aleksandra Faust, David Hsu, and Gerhard Neumann, editors, Proceedings of the 5th Conference on Robot Learning, volume 164 of Proceedings of Machine Learning Research, pages 180–191. PMLR, 08–11 Nov 2022.
  44. Argoverse 2: Next generation datasets for self-driving perception and forecasting. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran, 2021.
  45. M22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTBEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation. arXiv e-prints, page arXiv:2204.05088, April 2022.
  46. Curvelane-nas: Unifying lane-sensitive architecture search and adaptive point blending. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 689–704, Cham, 2020. Springer International Publishing.
  47. Second: Sparsely embedded convolutional detection. Sensors, 18(10), 2018.
  48. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11784–11793, June 2021.
  49. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12104–12113, June 2022.
  50. Ji Zhang and Sanjiv Singh. LOAM: lidar odometry and mapping in real-time. In Robotics: Science and Systems X, University of California, 2014.
  51. Deformable DETR: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations, 2021.
Citations (4)

Summary

We haven't generated a summary for this paper yet.