EMMA: End-to-End Multimodal Model for Autonomous Driving (2410.23262v2)
Abstract: We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multi-modal LLM foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained LLMs, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA's effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA's potential as a generalist model for autonomous driving applications. However, EMMA also exhibits certain limitations: it can process only a small amount of image frames, does not incorporate accurate 3D sensing modalities like LiDAR or radar and is computationally expensive. We hope that our results will inspire further research to mitigate these issues and to further evolve the state of the art in autonomous driving model architectures.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. RSS, 2019.
- End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
- Rt-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- End-to-end object detection with transformers. In ECCV, 2020.
- Gri: General reinforced imitation and its application to vision-based autonomous driving. Robotics, 2023.
- Learning by cheating. In CoRL, 2020.
- Learning to drive from a world on rails. In ICCV, 2021.
- Womd-lidar: Raw sensor dataset benchmark for motion forecasting. In ICRA, 2024a.
- Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In ICRA, 2024b.
- Pix2seq: A language modeling framework for object detection. In ICLR, 2022a.
- A unified sequence interface for vision tasks. In NeurIPS, 2022b.
- PaLI: A Jointly-Scaled Multilingual Language-Image Model. In ICLR, 2023.
- Pali-x: On scaling up a multilingual vision and language model. In CVPR, 2024c.
- Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. PAMI, 2022.
- Unifying vision-and-language tasks via text generation. In ICML, 2021.
- Palm: Scaling language modeling with pathways. JMLR, 2023.
- End-to-end driving via conditional imitation learning. In ICRA, 2018.
- J. Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- Pivotnet: Vectorized pivot learning for end-to-end hd map construction. In ICCV, 2023.
- Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Gemini Team Google. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- Training compute-optimal large language models. In NeurIPS, 2022.
- Planning-oriented autonomous driving. In CVPR, 2023.
- Language is not all you need: Aligning perception with language models. In NeurIPS, 2023.
- Let-3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection. In ICRA, 2024.
- Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection. In ECCV, 2022.
- Vad: Vectorized scene representation for efficient autonomous driving. In ICCV, 2023.
- Learning to drive in a day. In ICRA, 2019.
- Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019.
- Hdmapnet: An online hd map construction and evaluation framework. In ICRA, 2022a.
- Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022b.
- Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024.
- Cirl: Controllable imitative reinforcement learning for vision-based self-driving. In ECCV, 2018.
- Maptr: Structured modeling and learning for online vectorized hd map construction. In ICLR, 2023.
- Lane graph as path: Continuity-preserving path-wise modeling for online lane graph construction. In ECCV, 2024a.
- Maptrv2: An end-to-end framework for online vectorized hd map construction. IJCV, 2024b.
- Visual instruction tuning. In NeurIPS, 2024a.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, 2024b.
- Vectormapnet: End-to-end vectorized hd map learning. In ICML, 2023.
- Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2022.
- Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In CVPR, 2024.
- Wayformer: Motion forecasting via simple & efficient attention networks. In ICRA, 2023.
- Kosmos-2: Grounding multimodal large language models to the world. In ICLR, 2024.
- D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In NeurIPS, 1988.
- Multi-modal fusion transformer for end-to-end autonomous driving. In CVPR, 2021.
- End-to-end vectorized hd-map construction with piecewise bezier curve. In CVPR, 2023.
- Improving language understanding by generative pre-training. OpenAI blog, 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Motionlm: Multi-agent motion forecasting as language modeling. In ICCV, 2023.
- Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying. PAMI, 2024.
- Drivelm: Driving with graph visual question answering. In ECCV, 2024.
- Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020.
- Swformer: Sparse window transformer for 3d object detection in point clouds. In ECCV, 2022.
- Block-nerf: Scalable large scene neural view synthesis. In CVPR, 2022.
- Motion planning for autonomous driving: The state of the art and future perspectives. T-IV, 2023.
- Drivevlm: The convergence of autonomous driving and large vision-language models. In CoRL, 2024.
- End-to-end model-free reinforcement learning for urban driving using implicit affordances. In CVPR, 2020.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Attention is all you need. In NeurIPS, 2017.
- Show and tell: A neural image caption generator. In CVPR, 2015.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022.
- Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning. arXiv preprint arXiv:2405.01533, 2024a.
- Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, 2021.
- Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models. In ICRA, 2024b.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. In NeurIPS, 2024c.
- Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
- Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. In NeurIPS, 2022.
- Drivegpt4: Interpretable end-to-end autonomous driving via large language model. RA-L, 2024.
- Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
- Streammapnet: Streaming mapping network for vectorized online hd map construction. In WACV, 2024.
- A survey of autonomous driving: Common practices and emerging technologies. IEEE access, 2020.
- Open-vocabulary object detection using captions. In CVPR, 2021.
- Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023.
- J. Zhang and E. Ohn-Bar. Learning by watching. In CVPR, 2021.
- End-to-end urban driving by imitating a reinforcement learning coach. In ICCV, 2021.
- Jyh-Jing Hwang (13 papers)
- Runsheng Xu (40 papers)
- Hubert Lin (9 papers)
- Wei-Chih Hung (25 papers)
- Jingwei Ji (16 papers)
- Kristy Choi (14 papers)
- Di Huang (203 papers)
- Tong He (124 papers)
- Paul Covington (4 papers)
- Benjamin Sapp (16 papers)
- James Guo (3 papers)
- Dragomir Anguelov (73 papers)
- Mingxing Tan (46 papers)
- Yin Zhou (32 papers)