Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LMDrive: Closed-Loop End-to-End Driving with Large Language Models (2312.07488v2)

Published 12 Dec 2023 in cs.CV, cs.AI, and cs.RO
LMDrive: Closed-Loop End-to-End Driving with Large Language Models

Abstract: Despite significant recent progress in the field of autonomous driving, modern methods still struggle and can incur serious accidents when encountering long-tail unforeseen events and challenging urban scenarios. On the one hand, LLMs (LLM) have shown impressive reasoning capabilities that approach "Artificial General Intelligence". On the other hand, previous autonomous driving methods tend to rely on limited-format inputs (e.g. sensor data and navigation waypoints), restricting the vehicle's ability to understand language information and interact with humans. To this end, this paper introduces LMDrive, a novel language-guided, end-to-end, closed-loop autonomous driving framework. LMDrive uniquely processes and integrates multi-modal sensor data with natural language instructions, enabling interaction with humans and navigation software in realistic instructional settings. To facilitate further research in language-based closed-loop autonomous driving, we also publicly release the corresponding dataset which includes approximately 64K instruction-following data clips, and the LangAuto benchmark that tests the system's ability to handle complex instructions and challenging driving scenarios. Extensive closed-loop experiments are conducted to demonstrate LMDrive's effectiveness. To the best of our knowledge, we're the very first work to leverage LLMs for closed-loop end-to-end autonomous driving. Codes, models, and datasets can be found at https://github.com/opendilab/LMDrive

Overview of LMDrive: Closed-Loop End-to-End Driving with LLMs

The paper "LMDrive: Closed-Loop End-to-End Driving with LLMs" presents a novel approach to autonomous driving by integrating LLMs into a closed-loop, end-to-end driving system. Recognizing the limitations of existing methods, which predominantly rely on fixed-format inputs like sensor data and navigation waypoints, LMDrive innovatively harnesses the reasoning capabilities of LLMs to allow for natural language interaction and control.

Framework Description

LMDrive is designed to process multi-modal sensor data along with natural language instructions to generate control signals in real time. It operates in a closed-loop setting, unlike prior approaches that primarily utilize LLMs in open-loop configurations. The framework consists of:

  1. Vision Encoder: This module processes multi-view and multi-modal sensor data, including camera images and LiDAR input, to produce visual tokens. It is initially pre-trained on perception tasks such as object detection and waypoint prediction, facilitating a comprehensive scene understanding.
  2. LLM Integration: A pre-trained LLaMA model is utilized as the core component of the system, responsible for understanding instructions and generating driving actions. The model leverages the encoded visual tokens and engages in closed-loop prediction of control signals and completion status of given instructions.

The inclusion of a Q-Former and learnable adapters enhances the interaction between vision-encoded data and the LLM, ensuring efficient token processing and accurate action generation.

Dataset and Benchmark

To support the development and evaluation of LMDrive, the authors introduce a dataset comprising 64,000 clips collected from simulations in the CARLA environment. The dataset features multi-modal sensor data combined with navigation and notice instructions. Additionally, LangAuto, a benchmark designed to test the system's performance in processing complex language instructions, is introduced.

Experimental Results

Extensive experiments on the LangAuto benchmark highlight LMDrive's competence in executing driving tasks informed by natural language under diverse and challenging scenarios. The system's driving score metrics, including route completion and infraction scores, demonstrate its practical viability in handling real-world complexities.

Implications and Future Directions

The integration of LLMs into autonomous driving systems, as exemplified by LMDrive, carries significant implications. The ability to interpret and act on natural language instructions enables improved human-vehicle interaction and adaptability to unforeseen urban challenges. This advancement opens avenues for further exploration in human-machine collaboration within autonomous systems.

Future efforts could focus on leveraging reinforcement learning to enhance the model's adaptability, expanding datasets to cover a broader range of real-world conditions, and refining LLM architectures to improve processing efficiency and control accuracy in dynamic environments.

In conclusion, LMDrive marks an important step towards more interactive and cognitively aware autonomous driving systems, with the potential to influence both theoretical research trajectories and practical applications in the field of autonomous vehicles.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. Implicit latent variable model for scene-consistent motion forecasting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pages 624–641. Springer, 2020.
  3. Learning from all vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17222–17231, 2022.
  4. Learning by cheating. In Conference on Robot Learning, pages 66–75. PMLR, 2020.
  5. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a.
  6. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. arXiv preprint arXiv:2310.01957, 2023b.
  7. End-to-end autonomous driving: Challenges and frontiers. arXiv preprint arXiv:2306.16927, 2023c.
  8. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  9. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  11. DriveLM Contributors. Drivelm: Drive on language. https://github.com/OpenDriveLab/DriveLM, 2023.
  12. Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017.
  13. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  14. Koala: A dialogue model for academic research. Blog post, April, 1, 2023.
  15. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  16. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
  17. Hidden biases of end-to-end driving models. 2023a.
  18. Hidden biases of end-to-end driving models. arXiv preprint arXiv:2306.07957, 2023b.
  19. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7953–7963, 2023a.
  20. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21983–21994, 2023b.
  21. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019.
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  23. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022.
  24. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  25. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  26. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  27. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  28. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
  29. Multi-modal fusion transformer for end-to-end autonomous driving. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  30. Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 918–927, 2018.
  31. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  32. Languagempc: Large language models as decision makers for autonomous driving. arXiv preprint arXiv:2310.03026, 2023.
  33. Temporal interlacing network. AAAI, 2020.
  34. Safety-enhanced autonomous driving using interpretable sensor fusion transformer. In Conference on Robot Learning, pages 726–737. PMLR, 2023a.
  35. Reasonnet: End-to-end driving with temporal and global reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13723–13733, 2023b.
  36. CARLA team. Carla autonomous driving leaderboard. https://leaderboard.carla.org/, 2020. Accessed: 2021-02-11.
  37. Trisha Thadani. Cruise recalls all its driverless cars after pedestrian hit and dragged. The Washintong Post.
  38. End-to-end model-free reinforcement learning for urban driving using implicit affordances. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7153–7162, 2020.
  39. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  41. Jordan Valinsky. ‘complete meltdown’: Driverless cars in san francisco stall causing a traffic jam. CNN Business.
  42. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  43. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
  44. Efficient reinforcement learning for autonomous driving with parameterized skills and priors. arXiv preprint arXiv:2305.04412, 2023.
  45. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong baseline. Advances in Neural Information Processing Systems, 35:6119–6132, 2022.
  46. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412, 2023.
  47. Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021.
  48. End-to-end urban driving by imitating a reinforcement learning coach. In Proceedings of the IEEE/CVF international conference on computer vision, pages 15222–15232, 2021.
  49. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  50. Accelerating reinforcement learning for autonomous driving using task-agnostic and ego-centric motion skills. arXiv preprint arXiv:2209.12072, 2022.
  51. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Hao Shao (25 papers)
  2. Yuxuan Hu (35 papers)
  3. Letian Wang (26 papers)
  4. Steven L. Waslander (59 papers)
  5. Yu Liu (784 papers)
  6. Hongsheng Li (340 papers)
Citations (64)
Github Logo Streamline Icon: https://streamlinehq.com