VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning (2402.13243v1)
Abstract: Learning a human-like driving policy from large-scale driving demonstrations is promising, but the uncertainty and non-deterministic nature of planning make it challenging. In this work, to cope with the uncertainty problem, we propose VADv2, an end-to-end driving model based on probabilistic planning. VADv2 takes multi-view image sequences as input in a streaming manner, transforms sensor data into environmental token embeddings, outputs the probabilistic distribution of action, and samples one action to control the vehicle. Only with camera sensors, VADv2 achieves state-of-the-art closed-loop performance on the CARLA Town05 benchmark, significantly outperforming all existing methods. It runs stably in a fully end-to-end manner, even without the rule-based wrapper. Closed-loop demos are presented at https://hgao-cv.github.io/VADv2.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Mp3: A unified model to map, perceive, predict and plan. In CVPR, 2021.
- Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449, 2019.
- Gri: General reinforced imitation and its application to vision-based autonomous driving. arXiv preprint arXiv:2111.08575, 2021.
- Learning to drive from a world on rails. In ICCV, 2021.
- Learning by cheating. 2020.
- Driving with llms: Fusing object-level vector modality for explainable autonomous driving. arXiv preprint arXiv:2310.01957, 2023.
- Exploring the limitations of behavior cloning for autonomous driving. In ICCV, 2019.
- Exploring the limitations of behavior cloning for autonomous driving. 2019.
- Hilm-d: Towards high-resolution understanding in multimodal large language models for autonomous driving. arXiv preprint arXiv:2309.05186, 2023.
- Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017.
- Drive like a human: Rethinking autonomous driving with large language models. arXiv preprint arXiv:2307.07162, 2023.
- Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In CVPR, 2020.
- Vip3d: End-to-end visual trajectory prediction via 3d agent queries. arXiv preprint arXiv:2208.01582, 2022.
- Openstreetmap: User-generated street maps. IEEE Pervasive computing, 2008.
- Model-based imitation learning for urban driving. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In ICCV, 2021.
- St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In ECCV, 2022.
- Planning-oriented autonomous driving. CVPR2023, 2022.
- Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. 2023.
- Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. In CVPR, 2023.
- Perceive, interact, predict: Learning dynamic and static clues for end-to-end motion prediction. arXiv preprint arXiv:2212.02181, 2022.
- Vad: Vectorized scene representation for efficient autonomous driving. ICCV, 2023.
- Hdmapnet: An online hd map construction and evaluation framework. In ICRA, 2022.
- Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092, 2022.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270, 2022.
- Learning lane graph representations for motion forecasting. In ECCV, 2020.
- Lane graph as path: Continuity-preserving path-wise modeling for online lane graph construction. arXiv preprint arXiv:2303.08815, 2023.
- Maptr: Structured modeling and learning for online vectorized hd map construction. arXiv preprint arXiv:2208.14437, 2022.
- Maptrv2: An end-to-end framework for online vectorized hd map construction. arXiv preprint arXiv:2308.05736, 2023.
- Mtd-gpt: A multi-task decision-making gpt model for autonomous driving at unsignalized intersections. arXiv preprint arXiv:2307.16118, 2023.
- Vectormapnet: End-to-end vectorized hd map learning. arXiv preprint arXiv:2206.08920, 2022.
- Multimodal motion prediction with stacked transformers. In CVPR, 2021.
- Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
- Nerf: Representing scenes as neural radiance fields for view synthesis. ECCV, 2020.
- Scene transformer: A unified architecture for predicting multiple agent trajectories. arXiv preprint arXiv:2106.08417, 2021.
- Covernet: Multimodal behavior prediction using trajectory sets. In CVPR, 2020.
- Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In ECCV, 2020.
- Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. NeurIPS, 1988.
- Multi-modal fusion transformer for end-to-end autonomous driving. 2021.
- Multi-modal fusion transformer for end-to-end autonomous driving. In CVPR, 2021.
- Improving language understanding by generative pre-training. 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Languagempc: Large language models as decision makers for autonomous driving. arXiv preprint arXiv:2310.03026, 2023.
- Safety-enhanced autonomous driving using interpretable sensor fusion transformer. In Conference on Robot Learning, pages 726–737. PMLR, 2023.
- End-to-end model-free reinforcement learning for urban driving using implicit affordances. In CVPR, 2020.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Exploring object-centric temporal modeling for efficient multi-view 3d object detection. arXiv preprint arXiv:2303.11926, 2023.
- Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving. arXiv preprint arXiv:2312.09245, 2023.
- Dilu: A knowledge-driven approach to autonomous driving with large language models. arXiv preprint arXiv:2309.16292, 2023.
- Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412, 2023.
- Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. arXiv preprint arXiv:2211.10439, 2022.
- Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022.
- End-to-end urban driving by imitating a reinforcement learning coach. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.