A Survey on Multimodal Large Language Models for Autonomous Driving (2311.12320v1)

Published 21 Nov 2023 in cs.AI

Abstract: With the emergence of LLMs and Vision Foundation Models (VFMs), multimodal AI systems benefiting from large models have the potential to equally perceive the real world, make decisions, and control tools as humans. In recent months, LLMs have shown widespread attention in autonomous driving and map systems. Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors to apply in LLM driving systems. In this paper, we present a systematic investigation in this field. We first introduce the background of Multimodal LLMs (MLLMs), the multimodal models development using LLMs, and the history of autonomous driving. Then, we overview existing MLLM tools for driving, transportation, and map systems together with existing datasets and benchmarks. Moreover, we summarized the works in The 1st WACV Workshop on Large Language and Vision Models for Autonomous Driving (LLVM-AD), which is the first workshop of its kind regarding LLMs in autonomous driving. To further promote the development of this field, we also discuss several important problems regarding using MLLMs in autonomous driving systems that need to be solved by both academia and industry.

PDF Abstract

A Survey on Multimodal LLMs for Autonomous Driving

The research paper titled "A Survey on Multimodal LLMs for Autonomous Driving" presents a comprehensive overview of the emerging field that combines Multimodal LLMs (MLLMs) with autonomous driving technologies. This survey highlights the confluence of advancements in LLMs that are capable of complex reasoning with multimodal sensory data processing, which can be applied to autonomous vehicles—reshaping the landscape of intelligent transportation systems.

Advancements in MLLMs and Autonomous Driving

The authors elucidate the evolution of LLMs, particularly noting the transformative impact of modality alignment whereby LLMs integrate image, video, and audio data, enhancing their capability to perform a multitude of tasks. This alignment is crucial for autonomous driving systems as it mirrors human-like perception and decision-making capabilities.

Concurrently, the paper tracks progress in autonomous driving, noting pivotal developments such as localization and classification systems, and emphasizing the shift from traditional software pipelines to more comprehensive cognitive architectures fostered by MLLMs. The rise of Semantic Segmentation and 3D object detection have been pivotal in enhancing perception robustness in autonomous driving, particularly with Bird's-Eye View (BEV) frameworks.

Applications of MLLMs in Autonomous Driving

A significant portion of the survey is dedicated to applications of MLLMs in the autonomous driving sector. The integration of LLMs in perception systems allows for enhanced interpretation of complex environments by combining linguistic context with rich visual data inputs. Models like CLIP and later advancements are revisited for their contributions in leveraging visual-linguistic connections to improve scene understanding. Moreover, driving-focused LLMs are specifically engineered to debug both the spoken language interfaces as well as navigational aids such as GPS enhancements.

For planning and control, the synthesis of motion information with textual understanding provides a novel interface for user engagement with the autonomous system. Here, the ability of LLMs to deliver reasoning under uncertain scenarios and the generation of safe, compliant trajectories in real-time is a notable advancement.

Datasets and Benchmarks

The paper highlights the vital role of diverse and robust datasets in training MLLMs for autonomous driving tasks. It acknowledges the limitations of current datasets which lack the variance needed to comprehensively train and test these models. The authors advocate for the development of datasets capturing a wide array of traffic scenarios and highlight the role of workshops like LLVM-AD, which aim to enrich the dataset repertoire as well as bring together industry and academia to push forward the frontiers of practical deployment of MLLMs in autonomous driving.

Implications and Future Directions

The survey concludes with a discussion on future directions, pointing to the potential of MLLMs to revolutionize interactive systems in vehicles by offering more personalized user experiences and empowering automated systems to be more adaptable and safer. Challenges such as real-time processing constraints, data heterogeneity, and robustness across diverse environments are recognized as areas requiring further exploration. Additionally, the role of these models in enhancing operational safety through real-time decision-making and actions guided by sophisticated reasoning capabilities is underscored as holding significant promise for the future of autonomous vehicles.

Fundamentally, the survey serves as a point of convergence for the multiple streams of current research, establishing a groundwork upon which future comprehensive, multimodal-driven autonomous systems can be developed. As the field matures, the authors implore academia and industry to focus on creating more adaptive, scalable, and well-integrated models to support the evolving needs of high-level autonomous driving systems.