DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model (2310.01412v5)

Published 2 Oct 2023 in cs.CV and cs.RO

Abstract: Multimodal LLMs (MLLMs) have emerged as a prominent area of interest within the research community, given their proficiency in handling and reasoning with non-textual data, including images and videos. This study seeks to extend the application of MLLMs to the realm of autonomous driving by introducing DriveGPT4, a novel interpretable end-to-end autonomous driving system based on LLMs. Capable of processing multi-frame video inputs and textual queries, DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users. Furthermore, DriveGPT4 predicts low-level vehicle control signals in an end-to-end fashion.These advanced capabilities are achieved through the utilization of a bespoke visual instruction tuning dataset, specifically tailored for autonomous driving applications, in conjunction with a mix-finetuning training strategy. DriveGPT4 represents the pioneering effort to leverage LLMs for the development of an interpretable end-to-end autonomous driving solution. Evaluations conducted on the BDD-X dataset showcase the superior qualitative and quantitative performance of DriveGPT4. Additionally, the fine-tuning of domain-specific data enables DriveGPT4 to yield close or even improved results in terms of autonomous driving grounding when contrasted with GPT4-V.

PDF HTML Abstract

DriveGPT4: An Advancing Paradigm in Interpretable End-to-End Autonomous Driving

In recent years, advancements in Multimodal LLMs (MLLMs) have ushered in new possibilities for a variety of domains, including autonomous driving. The paper "DriveGPT4: Interpretable End-to-end Autonomous Driving via LLM" details a significant development in leveraging MLLMs for autonomous vehicle systems. DriveGPT4 integrates LLMs with video processing capabilities to create an interpretable and efficient end-to-end autonomous driving system. This paper elaborates on the design, training, and evaluation of the DriveGPT4 model, underscoring its potential impact on autonomous driving technology.

System Architecture and Methodology

DriveGPT4 stands out as it operates with end-to-end functionality, interpreting multi-frame video inputs together with textual queries to theoretically predict and control vehicle actions, while also providing detailed interpretations to facilitate human understanding. This system not only generates vehicle control signals but also answers human queries regarding vehicle actions and their underlying reasons. The haLLMark of DriveGPT4 is its capacity to utilize a specially-designed visual instruction tuning dataset tailored for autonomous driving, alongside a distinct mix-finetuning training strategy to enhance capabilities.

The input to DriveGPT4 consists of video frames processed through a well-structured video tokenizer that incorporates visual data into a format understandable by LLMs. The model uses LLaMA 2 as its foundational LLM, benefiting from its vast pretrained weights to enhance text prediction capabilities. The process extends to embedding vehicle control predictions into text formats, facilitating integrated text and action outputs. The bespoke dataset, derived from the BDD-X dataset with assistance from ChatGPT, is critical for refining DriveGPT4’s performance. This data, enriched with fine-tuning tasks, provides a robust framework for addressing diverse real-world challenges in autonomous driving.

Evaluation and Performance

Evaluations on the BDD-X dataset reveal DriveGPT4's superior performance across several metrics and conditions. The model has been demonstrated to predict vehicle actions and control signals more effectively than existing state-of-the-art frameworks. Notably, DriveGPT4 delivers improved results on complex driving scenarios, thereby enhancing the reliability and applicability of end-to-end learning systems in real-world autonomous driving tasks. Moreover, the model's capacity to offer detailed reasoning in natural language sets a new benchmark for interpretability in vehicular AI systems.

Discussion and Future Implications

From a theoretical perspective, DriveGPT4 bridges the gap between the vast reasoning capabilities of LLMs and the practical requirements of autonomous driving systems. It suggests the feasibility of training models that are not only proficient in vehicle control but also adept at articulating decision-making processes. Practically, this development can highly influence the design of autonomous vehicles, making them safer and more understandable for both expert users and the general public.

Moving forward, the paper intimates potential extensions of this research toward closed-loop systems for real-time vehicle control, where the ability to interpret and adapt to the dynamics of driving environments continuously could be transformative. Additionally, adapting such models for broader spectrum deployments across varied autonomous applications reveals that further developments will be crucial for addressing nuanced ethical and legal concerns inherent in autonomous driving.

In conclusion, DriveGPT4 marks a significant progression in harnessing MLLMs for interpretable autonomous driving, paving the way for future models that combine interpretability with practical efficacy. This research underpins both the promise of AI-driven advancements in transportation and the continuous evolution required to realize their full potential.

PDF Markdown Bookmark Chat (Pro)

References (54)

Authors (8)

Zhenhua Xu (22 papers)
Yujia Zhang (37 papers)
Enze Xie (84 papers)
Zhen Zhao (85 papers)
Yong Guo (67 papers)
Zhenguo Li (195 papers)
Hengshuang Zhao (117 papers)
Kwan-Yee. K. Wong (2 papers)

Citations (162)

View on Semantic Scholar

DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model (2310.01412v5)

DriveGPT4: An Advancing Paradigm in Interpretable End-to-End Autonomous Driving

System Architecture and Methodology

Evaluation and Performance

Discussion and Future Implications

Related Papers