DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving (2312.09245v2)

Published 14 Dec 2023 in cs.CV

Abstract: LLMs have opened up new possibilities for intelligent agents, endowing them with human-like thinking and cognitive abilities. In this work, we delve into the potential of LLMs in autonomous driving (AD). We introduce DriveMLM, an LLM-based AD framework that can perform close-loop autonomous driving in realistic simulators. To this end, (1) we bridge the gap between the language decisions and the vehicle control commands by standardizing the decision states according to the off-the-shelf motion planning module. (2) We employ a multi-modal LLM (MLLM) to model the behavior planning module of a module AD system, which uses driving rules, user commands, and inputs from various sensors (e.g., camera, lidar) as input and makes driving decisions and provide explanations; This model can plug-and-play in existing AD systems such as Apollo for close-loop driving. (3) We design an effective data engine to collect a dataset that includes decision state and corresponding explanation annotation for model training and evaluation. We conduct extensive experiments and show that our model achieves 76.1 driving score on the CARLA Town05 Long, and surpasses the Apollo baseline by 4.7 points under the same settings, demonstrating the effectiveness of our model. We hope this work can serve as a baseline for autonomous driving with LLMs. Code and models shall be released at https://github.com/OpenGVLab/DriveMLM.

PDF HTML Abstract

DriveMLM: Integrating Multi-Modal LLMs for Enhanced Autonomous Driving

The paper "DriveMLM: Aligning Multi-Modal LLMs with Behavioral Planning States for Autonomous Driving" presents an ambitious exploration into the potential of LLMs in transforming autonomous driving (AD) systems, specifically through the proposed DriveMLM framework. This framework leverages the cognitive capabilities of LLMs to execute closed-loop autonomous driving in virtual environments. The research delineates a new horizon for AD systems by combining linguistic prowess with vehicular control in real-time scenarios.

Core Contributions

The paper identifies three principal innovations:

Behavioral Planning State Alignment: This innovation addresses the critical challenge of translating language-based decisions into discernible vehicular control commands. DriveMLM achieves this by aligning its decision outputs with the states of a traditional behavioral planning module, such as that in the Apollo framework. This alignment transforms conceptual language outputs into actionable vehicle control states.
Integration of a Multi-Modal LLM Planner: At the heart of the DriveMLM framework lies a planner that processes multi-modal inputs (e.g., images, LiDAR data, traffic rules, and user commands) using a Multi-Modal LLM (MLLM). This planner is adept at processing varied data types to predict driving decisions and articulate explanations, leveraging the semantic and situational learning from LLMs.
Efficient Data Collection Strategy: A bespoke data generation pipeline curates an expansive dataset featuring diverse scenarios that include driving decisions and their linguistic explanations. This dataset is integral for training the DriveMLM, offering depth in both decision and contextual understanding.

Experimental Results and Analysis

Extensive evaluations underscore the proposed model's proficiency, showcasing an impressive 76.1 driving score on the CARLA Town05 Long benchmark. This result signifies a 4.7 point improvement over the baseline provided by the Apollo system. Moreover, a noted improvement of 1.25 times in Miles Per Intervention (MPI) indicates robust decision-making capabilities, particularly in complex or novel driving situations. Such performance highlights the model's adept handling of decision transitions, yielding a safer driving experience in simulation environments.

Interestingly, DriveMLM outperformed traditional rule-based systems and recent data-driven methods by offering enhanced adaptability to varying road situations and nuanced user commands, such as yielding to an ambulance. The framework's ability to explain its driving rationale further aids in demystifying autonomous decision-making processes for end-users, thus enhancing the trust and transparency of AD systems.

Implications and Future Directions

The implications of aligning LLM capabilities with autonomous systems are substantial. On a practical level, this advancement promises to enhance the adaptability and robustness of AD systems in real-world applications. The ability to handle unique and unstructured driving scenarios with linguistic and situational intelligence marks a significant progression from purely data-driven vehicle control paradigms.

Theoretically, this integration betokens a paradigm shift in how autonomous systems perceive, plan, and interact within their environments. The marriage of LLMs and AD systems paves the way for autonomous agents that can bridge high-level cognitive reasoning with low-level operational tasks, potentially leading to systems that learn and adapt continuously from both human inputs and empirical data.

For future research, expanding the scope of DriveMLM to include real-world driving conditions could validate and enhance the simulation results. Advancements could focus on improving the real-time processing efficiency of multi-modal inputs and further refining decision-making accuracy in diverse environmental conditions. Ultimately, the journey towards truly autonomous vehicles requires continual innovation and interdisciplinary collaboration, harnessing the combined potential of artificial intelligence domains such as NLP and computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (16)

Wenhai Wang (123 papers)
Jiangwei Xie (8 papers)
ChuanYang Hu (5 papers)
Haoming Zou (1 paper)
Jianan Fan (13 papers)
Wenwen Tong (4 papers)
Yang Wen (13 papers)
Silei Wu (2 papers)
Hanming Deng (9 papers)
Zhiqi Li (42 papers)
Hao Tian (146 papers)
Lewei Lu (55 papers)
Xizhou Zhu (73 papers)
Xiaogang Wang (230 papers)
Yu Qiao (563 papers)
Jifeng Dai (131 papers)

Citations (84)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - OpenGVLab/DriveMLM (149 stars)