Emergent Mind

A Survey for Foundation Models in Autonomous Driving

Published Feb 2, 2024 in cs.LG , cs.CV , and cs.RO


The advent of foundation models has revolutionized the fields of natural language processing and computer vision, paving the way for their application in autonomous driving (AD). This survey presents a comprehensive review of more than 40 research papers, demonstrating the role of foundation models in enhancing AD. Large language models contribute to planning and simulation in AD, particularly through their proficiency in reasoning, code generation and translation. In parallel, vision foundation models are increasingly adapted for critical tasks such as 3D object detection and tracking, as well as creating realistic driving scenarios for simulation and testing. Multi-modal foundation models, integrating diverse inputs, exhibit exceptional visual understanding and spatial reasoning, crucial for end-to-end AD. This survey not only provides a structured taxonomy, categorizing foundation models based on their modalities and functionalities within the AD domain but also delves into the methods employed in current research. It identifies the gaps between existing foundation models and cutting-edge AD approaches, thereby charting future research directions and proposing a roadmap for bridging these gaps.


  • Foundation Models (FMs) represent a significant shift in autonomous driving (AD), enabling enhanced perception, interpretation, and interaction with environments through extensive dataset training.

  • LLMs contribute to AD by improving reasoning, planning, and user interface through techniques like prompt engineering and in-context learning.

  • Vision Foundation Models face challenges in AD perception tasks but offer potential in creating realistic virtual driving scenarios through generative and world models.

  • Multi-modal Foundation Models improve AD by leveraging data across different modalities for advanced reasoning, planning, and visual understanding.

Introduction: Evolving Landscape of Foundation Models in Autonomous Driving

The development and integration of Foundation Models (FMs) into the realm of Autonomous Driving (AD) marks a monumental shift in how vehicles perceive, interpret, and interact with their environment. Unlike traditional models that often lean on manually annotated data, FMs offer a far-reaching potential to understand and respond to complex driving scenarios by leveraging their training on extensive datasets. This essay delves into the critical role that these models, especially LLMs, Vision Foundation Models, and Multi-modal Foundation Models, play in enhancing AD systems across various facets.

LLMs in AD

Reasoning and Planning

LLMs have significantly contributed to AD, especially in reasoning and planning tasks. Their ability to assimilate common-sense knowledge from diverse web data has made them instrumental in making nuanced driving decisions. Through techniques such as prompt engineering and in-context learning, LLMs can provide recommendations and elucidate their reasoning, enhancing the explainability of autonomous driving decisions.

Prediction and User Interface

The application of LLMs extends to predicting traffic participants' future trajectories and intents, fundamentally redefining trajectory prediction methodologies. They also represent a paradigm shift in how Autonomous Vehicles (AVs) understand and execute user commands, moving away from a limited set of pre-defined commands to interpreting free-form instructions with unprecedented accuracy.

Vision Foundation Models in Perception and Beyond

Perception Challenges

While Vision Foundation Models like DINO and SAM have shown promise across computer vision tasks, their application in AD, particularly in perception tasks like 3D object detection, faces challenges. These models struggle with handling sparse and noisy data from AD sensors, highlighting the gap between current capabilities and the requirements for real-world AD applications.

Video Generation and World Models

Notably, the utilization of generative models and world models for creating realistic virtual driving scenarios presents a promising avenue for AD simulation and testing. The combination of camera images, text descriptions, and control signals to produce high-fidelity driving scenes underscores the potential of vision foundation models in enhancing AD technologies.

Multi-modal Foundation Models: Bridging Gaps Between Modalities

Advanced Reasoning and Planning

The integration of multi-modal inputs in foundation models opens new vistas for AD. By processing data across different modalities, these models demonstrate superior spatial reasoning and visual understanding capabilities. Whether it’s navigating long-tail scenarios using common sense or performing unified perception and planning tasks, multi-modal foundation models stand out as a beacon of innovation in AD.

The Road Ahead: Limitations and Future Directions

Despite the noteworthy advancements, the journey of integrating foundation models into AD is fraught with limitations. The hallucination of wrong or misleading information by LLMs, the computational heft of these models, and their dependency on accurate perception systems are significant hurdles. Furthermore, the simulated environment of most research does not fully capture the complexity of real-world driving scenarios, presenting a critical realism gap.

Conclusion: Envisioning the Future of Autonomous Driving With Foundation Models

The exploration of foundation models in AD heralds a future where vehicles not only respond to their immediate environment but also understand and predict complex scenarios with a level of discernment akin to human drivers. Bridging the current gaps through domain-specific pre-training, model optimization, and real-world data acquisition will be pivotal. As we chart the course for future research, the promise of foundation models in realizing fully autonomous driving systems that are safe, efficient, and user-centric remains an exciting frontier in automotive technology.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

Test Your Knowledge

You answered out of questions correctly.

Well done!