Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EMMA: End-to-End Multimodal Model for Autonomous Driving (2410.23262v2)

Published 30 Oct 2024 in cs.CV, cs.AI, cs.CL, cs.LG, and cs.RO

Abstract: We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multi-modal LLM foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained LLMs, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA's effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on the Waymo Open Motion Dataset (WOMD). EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA's potential as a generalist model for autonomous driving applications. However, EMMA also exhibits certain limitations: it can process only a small amount of image frames, does not incorporate accurate 3D sensing modalities like LiDAR or radar and is computationally expensive. We hope that our results will inspire further research to mitigate these issues and to further evolve the state of the art in autonomous driving model architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (85)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst. RSS, 2019.
  5. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
  6. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In CoRL, 2023.
  7. Language models are few-shot learners. In NeurIPS, 2020.
  8. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  9. End-to-end object detection with transformers. In ECCV, 2020.
  10. Gri: General reinforced imitation and its application to vision-based autonomous driving. Robotics, 2023.
  11. Learning by cheating. In CoRL, 2020.
  12. Learning to drive from a world on rails. In ICCV, 2021.
  13. Womd-lidar: Raw sensor dataset benchmark for motion forecasting. In ICRA, 2024a.
  14. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In ICRA, 2024b.
  15. Pix2seq: A language modeling framework for object detection. In ICLR, 2022a.
  16. A unified sequence interface for vision tasks. In NeurIPS, 2022b.
  17. PaLI: A Jointly-Scaled Multilingual Language-Image Model. In ICLR, 2023.
  18. Pali-x: On scaling up a multilingual vision and language model. In CVPR, 2024c.
  19. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. PAMI, 2022.
  20. Unifying vision-and-language tasks via text generation. In ICML, 2021.
  21. Palm: Scaling language modeling with pathways. JMLR, 2023.
  22. End-to-end driving via conditional imitation learning. In ICRA, 2018.
  23. J. Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  24. Pivotnet: Vectorized pivot learning for end-to-end hd map construction. In ICCV, 2023.
  25. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
  26. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  27. Gemini Team Google. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  28. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
  29. Training compute-optimal large language models. In NeurIPS, 2022.
  30. Planning-oriented autonomous driving. In CVPR, 2023.
  31. Language is not all you need: Aligning perception with language models. In NeurIPS, 2023.
  32. Let-3d-ap: Longitudinal error tolerant 3d average precision for camera-only 3d detection. In ICRA, 2024.
  33. Cramnet: Camera-radar fusion with ray-constrained cross-attention for robust 3d object detection. In ECCV, 2022.
  34. Vad: Vectorized scene representation for efficient autonomous driving. In ICCV, 2023.
  35. Learning to drive in a day. In ICRA, 2019.
  36. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019.
  37. Hdmapnet: An online hd map construction and evaluation framework. In ICRA, 2022a.
  38. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, 2022b.
  39. Is ego status all you need for open-loop end-to-end autonomous driving? In CVPR, 2024.
  40. Cirl: Controllable imitative reinforcement learning for vision-based self-driving. In ECCV, 2018.
  41. Maptr: Structured modeling and learning for online vectorized hd map construction. In ICLR, 2023.
  42. Lane graph as path: Continuity-preserving path-wise modeling for online lane graph construction. In ECCV, 2024a.
  43. Maptrv2: An end-to-end framework for online vectorized hd map construction. IJCV, 2024b.
  44. Visual instruction tuning. In NeurIPS, 2024a.
  45. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In ECCV, 2024b.
  46. Vectormapnet: End-to-end vectorized hd map learning. In ICML, 2023.
  47. Unified-io: A unified model for vision, language, and multi-modal tasks. In ICLR, 2022.
  48. Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In CVPR, 2024.
  49. Wayformer: Motion forecasting via simple & efficient attention networks. In ICRA, 2023.
  50. Kosmos-2: Grounding multimodal large language models to the world. In ICLR, 2024.
  51. D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In NeurIPS, 1988.
  52. Multi-modal fusion transformer for end-to-end autonomous driving. In CVPR, 2021.
  53. End-to-end vectorized hd-map construction with piecewise bezier curve. In CVPR, 2023.
  54. Improving language understanding by generative pre-training. OpenAI blog, 2018.
  55. Language models are unsupervised multitask learners. OpenAI blog, 2019.
  56. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
  57. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  58. Motionlm: Multi-agent motion forecasting as language modeling. In ICCV, 2023.
  59. Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying. PAMI, 2024.
  60. Drivelm: Driving with graph visual question answering. In ECCV, 2024.
  61. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020.
  62. Swformer: Sparse window transformer for 3d object detection in point clouds. In ECCV, 2022.
  63. Block-nerf: Scalable large scene neural view synthesis. In CVPR, 2022.
  64. Motion planning for autonomous driving: The state of the art and future perspectives. T-IV, 2023.
  65. Drivevlm: The convergence of autonomous driving and large vision-language models. In CoRL, 2024.
  66. End-to-end model-free reinforcement learning for urban driving using implicit affordances. In CVPR, 2020.
  67. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  68. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  69. Attention is all you need. In NeurIPS, 2017.
  70. Show and tell: A neural image caption generator. In CVPR, 2015.
  71. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, 2022.
  72. Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning. arXiv preprint arXiv:2405.01533, 2024a.
  73. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In ICCV, 2021.
  74. Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models. In ICRA, 2024b.
  75. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. In NeurIPS, 2024c.
  76. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
  77. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. In NeurIPS, 2022.
  78. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. RA-L, 2024.
  79. Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
  80. Streammapnet: Streaming mapping network for vectorized online hd map construction. In WACV, 2024.
  81. A survey of autonomous driving: Common practices and emerging technologies. IEEE access, 2020.
  82. Open-vocabulary object detection using captions. In CVPR, 2021.
  83. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023.
  84. J. Zhang and E. Ohn-Bar. Learning by watching. In CVPR, 2021.
  85. End-to-end urban driving by imitating a reinforcement learning coach. In ICCV, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Jyh-Jing Hwang (13 papers)
  2. Runsheng Xu (40 papers)
  3. Hubert Lin (9 papers)
  4. Wei-Chih Hung (25 papers)
  5. Jingwei Ji (16 papers)
  6. Kristy Choi (14 papers)
  7. Di Huang (203 papers)
  8. Tong He (124 papers)
  9. Paul Covington (4 papers)
  10. Benjamin Sapp (16 papers)
  11. James Guo (3 papers)
  12. Dragomir Anguelov (73 papers)
  13. Mingxing Tan (46 papers)
  14. Yin Zhou (32 papers)
Citations (1)

Summary

Insightful Overview of "EMMA: End-to-End Multimodal Model for Autonomous Driving"

The paper "EMMA: End-to-End Multimodal Model for Autonomous Driving," introduces an innovative approach leveraging Multimodal LLMs (MLLMs) for developing autonomous driving systems. EMMA, the proposed system, integrates a diverse set of driving tasks into a unified framework, exploiting the robust capabilities of the Gemini model, a prominent MLLM architecture. This model seeks to improve upon conventional autonomous vehicle systems by directly transforming raw camera sensor data into driving-specific outputs such as planner trajectories, perception objects, and road graphs through a coherent language-based approach.

Methodological Contributions

The EMMA model is constructed around adapting large vision-LLMs to autonomous driving, an effort to enrich driving behavior with the extensive "world knowledge" such models possess. In contrast to traditional modular systems in autonomous driving that segment perception, mapping, and planning into distinct components, EMMA utilizes a unified approach. It processes various driving tasks within a shared language framework, generating task-specific outputs using tailored prompts. Such an end-to-end approach allows for an integrated system that bypasses the symbolic interfaces between modules prevalent in former systems, hence supporting enhanced scalability and flexibility in new or rare conditions.

Strong Numerical Results and Competitiveness

Empirically, the EMMA framework has demonstrated noteworthy performance on well-regarded datasets like nuScenes and Waymo Open Motion Dataset (WOMD), achieving state-of-the-art or highly competitive results in both motion planning and 3D object detection. Notably, EMMA surpasses the benchmarks set by previous systems despite relying exclusively on visual data, thus negating the need for more traditionally employed lidar or radar sensor data. This reliance solely on camera inputs is indicative of the potential to reduce dependency on expensive sensor hardware, which can be impactful for cost-effective autonomous vehicle development.

Complementary Insights and Future Directions

The paper highlights some limitations of the EMMA system, including its computational demands and its current inability to fully integrate with lidar or radar data for enhanced depth perception. These limitations indicate a promising avenue for future research in optimizing model architectures for reduced computation and extending multimodal capabilities to encompass additional sensor data types.

Furthermore, the exploration of chain-of-thought prompting within the EMMA framework represents a novel avenue for enhancing model reasoning and explainability. The paper establishes that integrating this reasoning mechanism can significantly improve motion planning quality.

Implications for Autonomous Driving and AI Development

The implications of EMMA are profound for autonomous driving applications, illustrating the viability of generalist models that can handle diverse and complex driving-related tasks without necessitating intricate hand-crafted systems. In addition, EMMA offers a foundational framework that can potentially accelerate the development of more adaptive and intelligent vehicular systems.

Theoretically, the work exemplifies the strength of utilizing generalized models for domain-specific applications, indicating that MLLMs can be beneficially adapted to specialized tasks such as autonomous driving. The research advocates for a broadened perspective on the application of LLMs, suggesting their utility beyond purely linguistic or general vision tasks.

In conclusion, EMMA marks a significant contribution to advancing autonomous driving technologies by illustrating a novel, streamlined framework that coalesces multiple modalities and task types into a unified, language-guided paradigm. While constraints regarding computational efficiency and real-world deployment remain, the paper sets a precedent for further exploration into robust, versatile AI systems in the autonomous driving domain. Future developments aligned with this research could lead to practical and highly adaptable autonomous vehicle architectures.

Youtube Logo Streamline Icon: https://streamlinehq.com