Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM4Drive: A Survey of Large Language Models for Autonomous Driving (2311.01043v4)

Published 2 Nov 2023 in cs.AI

Abstract: Autonomous driving technology, a catalyst for revolutionizing transportation and urban mobility, has the tend to transition from rule-based systems to data-driven strategies. Traditional module-based systems are constrained by cumulative errors among cascaded modules and inflexible pre-set rules. In contrast, end-to-end autonomous driving systems have the potential to avoid error accumulation due to their fully data-driven training process, although they often lack transparency due to their "black box" nature, complicating the validation and traceability of decisions. Recently, LLMs have demonstrated abilities including understanding context, logical reasoning, and generating answers. A natural thought is to utilize these abilities to empower autonomous driving. By combining LLM with foundation vision models, it could open the door to open-world understanding, reasoning, and few-shot learning, which current autonomous driving systems are lacking. In this paper, we systematically review a research line about \textit{LLMs for Autonomous Driving (LLM4AD)}. This study evaluates the current state of technological advancements, distinctly outlining the principal challenges and prospective directions for the field. For the convenience of researchers in academia and industry, we provide real-time updates on the latest advances in the field as well as relevant open-source resources via the designated link: https://github.com/Thinklab-SJTU/Awesome-LLM4AD.

Overview of LLM4Drive: A Survey of LLMs for Autonomous Driving

The paper "LLM4Drive: A Survey of LLMs for Autonomous Driving" by Yang et al., provides a comprehensive examination of leveraging LLMs to enhance autonomous driving (AD). The research systematically reviews the potential integration of LLMs into AD systems, addressing technological advancements, challenges, and future directions.

Key Insights and Contributions

The paper underscores a pivotal shift from traditional module-based systems to data-driven autonomous driving solutions. However, these end-to-end systems often exhibit a lack of decision transparency due to their "black box" nature. The introduction of LLMs into autonomous driving systems could potentially bridge this gap by improving decision-making, context understanding, and reasoning capabilities.

The authors categorize the LLM integration into autonomous driving into four primary areas:

  1. Planning and Control: LLMs can enhance vehicle decision-making processes, with approaches classified into fine-tuning pre-trained models and prompt engineering. These include comprehensive techniques like DriveMLM and LMDrive that leverage multi-modal inputs to generate high-level decision commands.
  2. Perception: By incorporating LLMs, there is an expected enhancement in tasks such as prediction, detection, and tracking. For example, HiLM-D integrates high-resolution information for risk object localization, demonstrating LLMs’ potential to elevate the perception capability in dynamic environments.
  3. Question Answering (QA): LLMs contribute significantly to QA systems by providing in-depth scene interpretation and decision rationalization. These capabilities are crucial for human-centric systems where understanding and interaction are key focus areas.
  4. Generation: The application of diffusion models to generate realistic datasets provides an avenue for creating synthetic driving scenarios under various conditions. This can serve as a resource for testing and validation, reducing data collection and annotation costs.

Implications and Future Directions

The integration of LLMs in autonomous driving is poised to offer several theoretical and practical advancements. Theoretically, the ability of LLMs to process multi-modal data and generate coherent responses enhances the overall understanding and interpretation of driving situations. Practically, these models can improve safety, efficiency, and the adaptability of autonomous vehicles to new environments.

The work also highlights the importance of datasets suited for LLM applications in autonomous driving. The exploration of datasets like NuScenes-QA and Reason2Drive expands the scope of LLM4AD by providing intricate driving scenarios and QA pairs essential for training and evaluation.

For future developments, continuous advancements in LLM architectures and their training paradigms hold promise for enhanced performance in AD tasks. The potential for LLMs to address the "long-tail problem" in perception and decision-making remains a critical area for ongoing research.

In conclusion, the survey presented in this paper provides a pivotal understanding of where and how LLMs can be integrated into the autonomous driving pipeline. While challenges such as model interpretability and ethical considerations persist, the intersection of LLMs and autonomous driving offers compelling avenues for innovation and improvement within the domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (130)
  1. Spice: Semantic propositional image caption evaluation, 2016.
  2. Explainable artificial intelligence for autonomous driving: A comprehensive overview and field guide for future research directions, 2023.
  3. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss, editors, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics.
  4. Multiple object tracking in recent times: A literature review, 2022.
  5. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process., 2008, 2008.
  6. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  7. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  8. Rt-2: Vision-language-action models transfer web knowledge to robotic control, 2023.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019.
  11. Learning from all vehicles. In CVPR, 2022.
  12. End-to-end autonomous driving: Challenges and frontiers. arXiv preprint arXiv:2306.16927, 2023.
  13. Driving with llms: Fusing object-level vector modality for explainable autonomous driving, 2023.
  14. Masked-attention mask transformer for universal image segmentation. 2022.
  15. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Pattern Analysis and Machine Intelligence (PAMI), 2023.
  16. Drive as you speak: Enabling human-like interaction with large language models in autonomous vehicles. arXiv preprint arXiv:2309.10228, 2023.
  17. Receive, reason, and react: Drive as you say with large language models in autonomous vehicles. arXiv preprint arXiv:2310.08034, 2023.
  18. Large language models for autonomous driving: Real-world experiments, 2023.
  19. Parting with misconceptions about learning-based vehicle motion planning. In CoRL, 2023.
  20. Multimodal trajectory prediction conditioned on lane-graph traversals, 2021.
  21. Talk2car: Taking control of your self-driving car. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019.
  22. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  23. Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving, 2023.
  24. Hilm-d: Towards high-resolution understanding in multimodal large language models for autonomous driving, 2023.
  25. Palm-e: An embodied multimodal language model, 2023.
  26. Panoptic nuscenes: A large-scale benchmark for lidar panoptic segmentation and tracking. arXiv preprint arXiv:2109.03805, 2021.
  27. Drive like a human: Rethinking autonomous driving with large language models, 2023.
  28. Gptscore: Evaluate as you desire, 2023.
  29. Magicdrive: Street view generation with diverse 3d geometry control, 2023.
  30. Gohome: Graph-oriented heatmap output for future motion estimation, 2021.
  31. Generative adversarial networks, 2014.
  32. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  33. End-to-end training of object class detectors for mean average precision, 2017.
  34. Denoising diffusion probabilistic models, 2020.
  35. Sim2real in robotics and automation: Applications and challenges. IEEE transactions on automation science and engineering, 18(2):398–400, 2021.
  36. Gaia-1: A generative world model for autonomous driving. arXiv preprint arXiv:2309.17080, 2023.
  37. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
  38. Gpt-4v takes the wheel: Evaluating promise and challenges for pedestrian behavior prediction, 2023.
  39. The detection and rectification for identity-switch based on unfalsified control, 2023.
  40. Autonomy 2.0: Why is self-driving always 5 years away? arXiv preprint arXiv:2107.08142, 2021.
  41. Ide-net: Interactive driving event and pattern extraction from human data. IEEE Robotics and Automation Letters, 6(2):3065–3072, 2021.
  42. Towards capturing the temporal dynamics for trajectory prediction: a coarse-to-fine approach. In CoRL, 2022.
  43. Multi-agent trajectory prediction by combining egocentric and allocentric views. In Conference on Robot Learning, pages 1434–1443. PMLR, 2022.
  44. Adriver-i: A general world model for autonomous driving, 2023.
  45. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving, 2023.
  46. Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023.
  47. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving, 2023.
  48. Adapt: Action-aware driving caption transformer, 2023.
  49. Surrealdriver: Designing generative driver agent simulation framework in urban contexts based on large language model, 2023.
  50. Can you text what is happening? integrating pre-trained language encoders into trajectory prediction models for autonomous driving, 2023.
  51. Text2video-zero: Text-to-image diffusion models are zero-shot video generators, 2023.
  52. Textual explanations for self-driving vehicles. Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  53. Grounding human-to-vehicle advice for self-driving vehicles, 2019.
  54. Grounding human-to-vehicle advice for self-driving vehicles. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  55. Auto-encoding variational bayes, 2022.
  56. Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe. arXiv preprint arXiv:2209.05324, 2022.
  57. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022.
  58. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In ECCV, pages 1–18. Springer, 2022.
  59. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  60. Graph-based topology reasoning for driving scenes. arXiv preprint arXiv:2304.05277, 2023.
  61. Drivingdiffusion: Layout-guided multi-view driving scene video generation with latent diffusion model. arXiv preprint arXiv:2310.07771, 2023.
  62. Pnpnet: End-to-end perception and prediction with tracking in the loop. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11553–11562, 2020.
  63. Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models, 2023.
  64. Visual instruction tuning, 2023.
  65. Mtd-gpt: A multi-task decision-making gpt model for autonomous driving at unsignalized intersections, 2023.
  66. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023.
  67. Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In ICRA, 2023.
  68. Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3569–3577, 2018.
  69. Valley: Video assistant with large language model enhanced ability, 2023.
  70. Videofusion: Decomposed diffusion models for high-quality video generation, 2023.
  71. Dolphins: Multimodal language model for driving, 2023.
  72. Lampilot: An open benchmark dataset for autonomous driving with language model programs, 2023.
  73. Drama: Joint risk localization and captioning in driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1043–1052, 2023.
  74. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
  75. A language agent for autonomous driving, 2023.
  76. Lingoqa: Video question answering for autonomous driving, 2023.
  77. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving, 2023.
  78. OpenAI. Gpt-4 technical report, 2023.
  79. Training language models to follow instructions with human feedback, 2022.
  80. Proto-clip: Vision-language prototypical network for few-shot learning, 2023.
  81. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
  82. On aliased resizing and surprising subtleties in gan evaluation, 2022.
  83. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. arXiv preprint arXiv:2305.14836, 2023.
  84. Improving language understanding by generative pre-training. 2018.
  85. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  86. Learning transferable visual models from natural language supervision, 2021.
  87. Hierarchical text-conditional image generation with clip latents, 2022.
  88. Generalized intersection over union: A metric and a loss for bounding box regression, 2019.
  89. Variational inference with normalizing flows, 2016.
  90. High-resolution image synthesis with latent diffusion models, 2021.
  91. U-net: Convolutional networks for biomedical image segmentation, 2015.
  92. Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning. arXiv preprint arXiv:2309.06597, 2023.
  93. Perceive, predict, and plan: Safe motion planning through interpretable semantic representations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pages 414–430. Springer, 2020.
  94. Languagempc: Large language models as decision makers for autonomous driving. arXiv preprint arXiv:2310.03026, 2023.
  95. Lmdrive: Closed-loop end-to-end driving with large language models, 2023.
  96. Motion transformer with global intention localization and local movement refinement. Advances in Neural Information Processing Systems, 35:6531–6543, 2022.
  97. ep-alm: Efficient perceptual augmentation of language models, 2023.
  98. Drivelm: Driving with graph visual question answering. arXiv preprint arXiv:2312.14150, 2023.
  99. Street-view image generation from a bird’s-eye view layout. arXiv preprint arXiv:2301.04634, 2023.
  100. Evaluation of large language models for decision making in autonomous driving, 2023.
  101. Domain knowledge distillation from large language model: An empirical study in the autonomous driving domain, 2023.
  102. Performance evaluation of deep learning networks for semantic segmentation of traffic stereo-pair images. In Proceedings of the 19th International Conference on Computer Systems and Technologies. ACM, sep 2018.
  103. Llama: Open and efficient foundation language models, 2023.
  104. Congested traffic states in empirical observations and microscopic simulations. Physical review E, 62(2):1805, 2000.
  105. Towards accurate generative models of video: A new metric and challenges, 2019.
  106. Cider: Consensus-based image description evaluation, 2015.
  107. Is chatgpt a good nlg evaluator? a preliminary study, 2023.
  108. Chatgpt as your vehicle co-pilot: An initial attempt. IEEE Transactions on Intelligent Vehicles, pages 1–17, 2023.
  109. Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving. arXiv preprint arXiv:2312.09245, 2023.
  110. Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777, 2023.
  111. Empowering autonomous driving with large language models: A safety perspective, 2023.
  112. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving, 2023.
  113. Dilu: A knowledge-driven approach to autonomous driving with large language models. arXiv preprint arXiv:2309.16292, 2023.
  114. On the road with gpt-4v(ision): Early explorations of visual-language model on autonomous driving, 2023.
  115. Policy pre-training for autonomous driving via self-supervised geometric modeling. In The Eleventh International Conference on Learning Representations, 2022.
  116. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline, 2022.
  117. Language prompt for autonomous driving, 2023.
  118. Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9878–9888, June 2021.
  119. Sutd-trafficqa: A question answering benchmark and an efficient network for video reasoning over traffic events, 2021.
  120. Bits: Bi-level imitation for traffic simulation, 2022.
  121. Drivegpt4: Interpretable end-to-end autonomous driving via large language model, 2023.
  122. Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout. arXiv preprint arXiv:2308.01661, 2023.
  123. Human-centric autonomous systems with llms for user command reasoning, 2023.
  124. The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023.
  125. Center-based 3d object detection and tracking, 2021.
  126. Motr: End-to-end multiple-object tracking with transformer. In European Conference on Computer Vision (ECCV), 2022.
  127. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  128. Trafficgpt: Viewing, processing and interacting with traffic foundation models. arXiv preprint arXiv:2309.06719, 2023.
  129. Guided conditional diffusion for controllable traffic simulation, 2022.
  130. Language-guided traffic simulation via scene-level diffusion, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zhenjie Yang (7 papers)
  2. Xiaosong Jia (21 papers)
  3. Hongyang Li (99 papers)
  4. Junchi Yan (241 papers)
Citations (53)
Youtube Logo Streamline Icon: https://streamlinehq.com