LMDrive: Closed-Loop End-to-End Driving with Large Language Models (2312.07488v2)

Published 12 Dec 2023 in cs.CV, cs.AI, and cs.RO

Abstract: Despite significant recent progress in the field of autonomous driving, modern methods still struggle and can incur serious accidents when encountering long-tail unforeseen events and challenging urban scenarios. On the one hand, LLMs (LLM) have shown impressive reasoning capabilities that approach "Artificial General Intelligence". On the other hand, previous autonomous driving methods tend to rely on limited-format inputs (e.g. sensor data and navigation waypoints), restricting the vehicle's ability to understand language information and interact with humans. To this end, this paper introduces LMDrive, a novel language-guided, end-to-end, closed-loop autonomous driving framework. LMDrive uniquely processes and integrates multi-modal sensor data with natural language instructions, enabling interaction with humans and navigation software in realistic instructional settings. To facilitate further research in language-based closed-loop autonomous driving, we also publicly release the corresponding dataset which includes approximately 64K instruction-following data clips, and the LangAuto benchmark that tests the system's ability to handle complex instructions and challenging driving scenarios. Extensive closed-loop experiments are conducted to demonstrate LMDrive's effectiveness. To the best of our knowledge, we're the very first work to leverage LLMs for closed-loop end-to-end autonomous driving. Codes, models, and datasets can be found at https://github.com/opendilab/LMDrive

PDF HTML Abstract

Overview of LMDrive: Closed-Loop End-to-End Driving with LLMs

The paper "LMDrive: Closed-Loop End-to-End Driving with LLMs" presents a novel approach to autonomous driving by integrating LLMs into a closed-loop, end-to-end driving system. Recognizing the limitations of existing methods, which predominantly rely on fixed-format inputs like sensor data and navigation waypoints, LMDrive innovatively harnesses the reasoning capabilities of LLMs to allow for natural language interaction and control.

Framework Description

LMDrive is designed to process multi-modal sensor data along with natural language instructions to generate control signals in real time. It operates in a closed-loop setting, unlike prior approaches that primarily utilize LLMs in open-loop configurations. The framework consists of:

Vision Encoder: This module processes multi-view and multi-modal sensor data, including camera images and LiDAR input, to produce visual tokens. It is initially pre-trained on perception tasks such as object detection and waypoint prediction, facilitating a comprehensive scene understanding.
LLM Integration: A pre-trained LLaMA model is utilized as the core component of the system, responsible for understanding instructions and generating driving actions. The model leverages the encoded visual tokens and engages in closed-loop prediction of control signals and completion status of given instructions.

The inclusion of a Q-Former and learnable adapters enhances the interaction between vision-encoded data and the LLM, ensuring efficient token processing and accurate action generation.

Dataset and Benchmark

To support the development and evaluation of LMDrive, the authors introduce a dataset comprising 64,000 clips collected from simulations in the CARLA environment. The dataset features multi-modal sensor data combined with navigation and notice instructions. Additionally, LangAuto, a benchmark designed to test the system's performance in processing complex language instructions, is introduced.

Experimental Results

Extensive experiments on the LangAuto benchmark highlight LMDrive's competence in executing driving tasks informed by natural language under diverse and challenging scenarios. The system's driving score metrics, including route completion and infraction scores, demonstrate its practical viability in handling real-world complexities.

Implications and Future Directions

The integration of LLMs into autonomous driving systems, as exemplified by LMDrive, carries significant implications. The ability to interpret and act on natural language instructions enables improved human-vehicle interaction and adaptability to unforeseen urban challenges. This advancement opens avenues for further exploration in human-machine collaboration within autonomous systems.

Future efforts could focus on leveraging reinforcement learning to enhance the model's adaptability, expanding datasets to cover a broader range of real-world conditions, and refining LLM architectures to improve processing efficiency and control accuracy in dynamic environments.

In conclusion, LMDrive marks an important step towards more interactive and cognitively aware autonomous driving systems, with the potential to influence both theoretical research trajectories and practical applications in the field of autonomous vehicles.

PDF Markdown Bookmark Chat (Pro)

References (51)

Authors (6)

Hao Shao (25 papers)
Yuxuan Hu (35 papers)
Letian Wang (26 papers)
Steven L. Waslander (59 papers)
Yu Liu (784 papers)
Hongsheng Li (340 papers)

Citations (64)

View on Semantic Scholar

GitHub

GitHub - opendilab/LMDrive: LMDrive: Closed-Loop End-to-End Driving with Large Language Models (777 stars)