Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model (2310.01412v5)

Published 2 Oct 2023 in cs.CV and cs.RO

Abstract: Multimodal LLMs (MLLMs) have emerged as a prominent area of interest within the research community, given their proficiency in handling and reasoning with non-textual data, including images and videos. This study seeks to extend the application of MLLMs to the realm of autonomous driving by introducing DriveGPT4, a novel interpretable end-to-end autonomous driving system based on LLMs. Capable of processing multi-frame video inputs and textual queries, DriveGPT4 facilitates the interpretation of vehicle actions, offers pertinent reasoning, and effectively addresses a diverse range of questions posed by users. Furthermore, DriveGPT4 predicts low-level vehicle control signals in an end-to-end fashion.These advanced capabilities are achieved through the utilization of a bespoke visual instruction tuning dataset, specifically tailored for autonomous driving applications, in conjunction with a mix-finetuning training strategy. DriveGPT4 represents the pioneering effort to leverage LLMs for the development of an interpretable end-to-end autonomous driving solution. Evaluations conducted on the BDD-X dataset showcase the superior qualitative and quantitative performance of DriveGPT4. Additionally, the fine-tuning of domain-specific data enables DriveGPT4 to yield close or even improved results in terms of autonomous driving grounding when contrasted with GPT4-V.

DriveGPT4: An Advancing Paradigm in Interpretable End-to-End Autonomous Driving

In recent years, advancements in Multimodal LLMs (MLLMs) have ushered in new possibilities for a variety of domains, including autonomous driving. The paper "DriveGPT4: Interpretable End-to-end Autonomous Driving via LLM" details a significant development in leveraging MLLMs for autonomous vehicle systems. DriveGPT4 integrates LLMs with video processing capabilities to create an interpretable and efficient end-to-end autonomous driving system. This paper elaborates on the design, training, and evaluation of the DriveGPT4 model, underscoring its potential impact on autonomous driving technology.

System Architecture and Methodology

DriveGPT4 stands out as it operates with end-to-end functionality, interpreting multi-frame video inputs together with textual queries to theoretically predict and control vehicle actions, while also providing detailed interpretations to facilitate human understanding. This system not only generates vehicle control signals but also answers human queries regarding vehicle actions and their underlying reasons. The haLLMark of DriveGPT4 is its capacity to utilize a specially-designed visual instruction tuning dataset tailored for autonomous driving, alongside a distinct mix-finetuning training strategy to enhance capabilities.

The input to DriveGPT4 consists of video frames processed through a well-structured video tokenizer that incorporates visual data into a format understandable by LLMs. The model uses LLaMA 2 as its foundational LLM, benefiting from its vast pretrained weights to enhance text prediction capabilities. The process extends to embedding vehicle control predictions into text formats, facilitating integrated text and action outputs. The bespoke dataset, derived from the BDD-X dataset with assistance from ChatGPT, is critical for refining DriveGPT4’s performance. This data, enriched with fine-tuning tasks, provides a robust framework for addressing diverse real-world challenges in autonomous driving.

Evaluation and Performance

Evaluations on the BDD-X dataset reveal DriveGPT4's superior performance across several metrics and conditions. The model has been demonstrated to predict vehicle actions and control signals more effectively than existing state-of-the-art frameworks. Notably, DriveGPT4 delivers improved results on complex driving scenarios, thereby enhancing the reliability and applicability of end-to-end learning systems in real-world autonomous driving tasks. Moreover, the model's capacity to offer detailed reasoning in natural language sets a new benchmark for interpretability in vehicular AI systems.

Discussion and Future Implications

From a theoretical perspective, DriveGPT4 bridges the gap between the vast reasoning capabilities of LLMs and the practical requirements of autonomous driving systems. It suggests the feasibility of training models that are not only proficient in vehicle control but also adept at articulating decision-making processes. Practically, this development can highly influence the design of autonomous vehicles, making them safer and more understandable for both expert users and the general public.

Moving forward, the paper intimates potential extensions of this research toward closed-loop systems for real-time vehicle control, where the ability to interpret and adapt to the dynamics of driving environments continuously could be transformative. Additionally, adapting such models for broader spectrum deployments across varied autonomous applications reveals that further developments will be crucial for addressing nuanced ethical and legal concerns inherent in autonomous driving.

In conclusion, DriveGPT4 marks a significant progression in harnessing MLLMs for interpretable autonomous driving, paving the way for future models that combine interpretability with practical efficacy. This research underpins both the promise of AI-driven advancements in transportation and the continuous evolution required to realize their full potential.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Explainable artificial intelligence for autonomous driving: A comprehensive overview and field guide for future research directions. arXiv preprint arXiv:2112.11561, 2021.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021.
  3. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72, 2005.
  4. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
  5. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  6. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  11621–11631, 2020.
  7. End-to-end autonomous driving: Challenges and frontiers. arXiv preprint arXiv:2306.16927, 2023.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. DriveLM Contributors. Drivelm: Drive on language. https://github.com/OpenDriveLab/DriveLM, 2023.
  10. Talk2car: Taking control of your self-driving car. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2088–2098, 2019.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  13. Drive like a human: Rethinking autonomous driving with large language models, 2023.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  15. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  17853–17862, 2023.
  16. Adapt: Action-aware driving caption transformer. arXiv preprint arXiv:2302.00673, 2023.
  17. Embracing large language models for medical applications: Opportunities and challenges. Cureus, 15(5), 2023.
  18. Interpretable learning for self-driving cars by visualizing causal attention. In Proceedings of the IEEE international conference on computer vision, pp.  2942–2950, 2017.
  19. Textual explanations for self-driving vehicles. Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  20. Grounding human-to-vehicle advice for self-driving vehicles. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  21. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023a.
  22. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp. 12888–12900. PMLR, 2022a.
  23. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  24. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023c.
  25. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10965–10975, 2022b.
  26. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp.  9493–9500. IEEE, 2023.
  27. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  28. The role of the hercules autonomous vehicle during the covid-19 pandemic: An autonomous logistic vehicle for contactless goods transportation. IEEE Robotics & Automation Magazine, 28(1):48–58, 2021.
  29. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207, 2023.
  30. Drama: Joint risk localization and captioning in driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1043–1052, 2023.
  31. OpneAI. ChatGPT.https://openai.com/blog/chatgpt/, 2023.
  32. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  33. A review on autonomous vehicles: Progress, methods and challenges. Electronics, 11(14):2162, 2022.
  34. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  35. Multi-modal fusion transformer for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7077–7087, 2021.
  36. Improving language understanding by generative pre-training. 2018.
  37. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  38. Real-time flying object detection with yolov8. arXiv preprint arXiv:2305.09972, 2023.
  39. Translating images into maps. In 2022 International conference on robotics and automation (ICRA), pp.  9200–9206. IEEE, 2022.
  40. Autonomous cars: Recent developments, challenges, and possible solutions. In IOP Conference Series: Materials Science and Engineering, volume 1022, pp.  012028. IOP Publishing, 2021.
  41. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  43. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4566–4575, 2015.
  44. Learning interpretable end-to-end vision-based motion planning for autonomous driving with optical flow distillation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp.  13731–13737. IEEE, 2021.
  45. Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv preprint arXiv:2304.14407, 2023.
  46. Language prompt for autonomous driving. arXiv preprint arXiv:2309.04379, 2023.
  47. Multimodal end-to-end autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 23(1):537–547, 2020.
  48. Centerlinedet: Road lane centerline graph detection with vehicle-mounted sensors by transformer for high-definition map creation. arXiv preprint arXiv:2209.07734, 2022.
  49. Rngdet++: Road network graph detection by transformer with instance segmentation and multi-scale features enhancement. IEEE Robotics and Automation Letters, 2023a.
  50. Insightmapper: A closer look at inner-instance information for vectorized high-definition mapping. arXiv preprint arXiv:2308.08543, 2023b.
  51. Learning attraction field representation for robust line segment detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1595–1603, 2019.
  52. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  53. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2881–2890, 2017.
  54. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhenhua Xu (22 papers)
  2. Yujia Zhang (37 papers)
  3. Enze Xie (84 papers)
  4. Zhen Zhao (85 papers)
  5. Yong Guo (67 papers)
  6. Zhenguo Li (195 papers)
  7. Hengshuang Zhao (117 papers)
  8. Kwan-Yee. K. Wong (2 papers)
Citations (162)