Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving (2310.01957v2)

Published 3 Oct 2023 in cs.RO, cs.AI, cs.CL, and cs.CV

Abstract: LLMs have shown promise in the autonomous driving sector, particularly in generalization and interpretability. We introduce a unique object-level multimodal LLM architecture that merges vectorized numeric modalities with a pre-trained LLM to improve context understanding in driving situations. We also present a new dataset of 160k QA pairs derived from 10k driving scenarios, paired with high quality control commands collected with RL agent and question answer pairs generated by teacher LLM (GPT-3.5). A distinct pretraining strategy is devised to align numeric vector modalities with static LLM representations using vector captioning language data. We also introduce an evaluation metric for Driving QA and demonstrate our LLM-driver's proficiency in interpreting driving scenarios, answering questions, and decision-making. Our findings highlight the potential of LLM-based driving action generation in comparison to traditional behavioral cloning. We make our benchmark, datasets, and model available for further exploration.

An Overview of "Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving"

The paper "Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving" proposes a cutting-edge framework for integrating LLMs with traditional autonomous driving systems to enhance interpretability and generalization capabilities. The methodology centers on marrying object-level vector modalities with pre-trained LLMs using a novel multimodal architecture, effectively enabling these models to better comprehend and react to driving scenarios.

The authors take an innovative approach by introducing a unique object-level vector modality that augments the LLMs' decision-making processes. This is achieved by embedding vectorized semantic representations of the driving context—such as details about nearby vehicles, pedestrians, and traffic signals—into the narrative capabilities of LLMs. Consequently, this allows the model to conduct spatial reasoning and infer actions while maintaining a coherent natural language explanation of those decisions.

Methodology and Contributions

The framework is structured around several key contributions:

  1. Novel Multimodal Architecture: The authors develop a proficient architecture that synergizes object-level vector modalities into any LLMs. This involves a two-stage pretraining and fine-tuning process that ensures the numeric vector data seamlessly integrates with textual representations.
  2. Extensive Dataset and Driving QA Task: The team assembled a sizable dataset containing 160,000 question-answer pairs derived from a broad spectrum of driving situations. This dataset acts as a benchmark for the driving scenarios explored in the paper and supports the evaluations of Driving QA tasks.
  3. Evaluation with Driving QA: A novel evaluation method for Driving QA is introduced, presenting robust benchmarks and an initial pretrained baseline to guide further research in the domain.

In terms of methodology, the paper employs reinforcement learning (RL) to collect high-quality training data within a driving simulation environment. The RL agent, acting as a pseudo-expert driver, aids in generating realistic control commands across numerous procedural scenarios. This approach circumvents the need for human experts and accelerates data acquisition.

The paper further elucidates a pretraining strategy where the object-level vector and language modalities are aligned using pseudo-captioning data. This process, combined with fine-tuning on the unique Driving QA dataset, positions the model to perform complex decision-making escalations and respond to nuanced driving queries.

Results and Implications

The empirical results highlight the model's proficiency across several dimensions. Key metrics regarding the accuracy of action prediction and driving question-answering tasks indicate substantial improvement over baseline behavior cloning methods, although challenges remain in spatial perception tasks. The model's superior performance in action-based reasoning accentuates the benefits of integrating the semantic depth of LLMs with numerically rich autonomous driving data.

The work fundamentally enhances the interpretability of autonomous systems, addressing traditional limitations in behavior transparency and out-of-distribution reasoning. The introduction of a structured language generator, capable of translating complex vector data into narrative form, represents a significant methodological advancement with potential applications beyond simulated environments.

Conclusion and Future Directions

The paper lays the groundwork for future explorations in embedding pre-trained language understanding into vehicular operations, aspiring to tackle both theoretical challenges and practical hurdles in the field. Enhanced by this framework, autonomous systems could gain higher levels of context awareness and decision-making clarity, leading to improved safety and public trust. Future research could explore refining the grounding process for numeric vectors, scaling the approach to real-world scenarios, and reducing the computational complexity of LLMs during closed-loop evaluations.

Overall, the paper's revelations underscore a pivotal shift toward explainable AI in autonomous systems, potentially steering the development trajectory in a direction that champions accountability and human-friendly interfaces.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang, “Sparks of artificial general intelligence: Early experiments with gpt-4,” 2023.
  2. L. Wells and T. Bednarz, “Explainable ai and reinforcement learning—a systematic review of current approaches and trends,” Frontiers in artificial intelligence, vol. 4, p. 550030, 2021.
  3. K. Lu, S. Zhang, P. Stone, and X. Chen, “Robot representation and reasoning with knowledge from reinforcement learning,” 2018.
  4. J. Hawke, V. Badrinarayanan, A. Kendall, et al., “Reimagining an autonomous vehicle,” arXiv preprint arXiv:2108.05805, 2021.
  5. L. Chen, L. Platinsky, S. Speichert, B. Osiński, O. Scheel, Y. Ye, H. Grimmett, L. Del Pero, and P. Ondruska, “What data do we need for training an av motion planner?” in 2021 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2021, pp. 1066–1072.
  6. S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, “A survey of deep learning techniques for autonomous driving,” Journal of Field Robotics, vol. 37, no. 3, pp. 362–386, 2020. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/rob.21918
  7. D. Omeiza, H. Webb, M. Jirotka, and L. Kunze, “Explanations in autonomous driving: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 10 142–10 162, aug 2022. [Online]. Available: https://doi.org/10.1109%2Ftits.2021.3122865
  8. N. F. Rajani, B. McCann, C. Xiong, and R. Socher, “Explain yourself! leveraging language models for commonsense reasoning,” 2019.
  9. X. Wang, G. Chen, G. Qian, P. Gao, X.-Y. Wei, Y. Wang, Y. Tian, and W. Gao, “Large-scale multi-modal pre-trained models: A comprehensive survey,” 2023.
  10. E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” 2021.
  11. M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst,” arXiv preprint arXiv:1812.03079, 2018.
  12. J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid, “Vectornet: Encoding hd maps and agent dynamics from vectorized representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 525–11 533.
  13. D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” Advances in neural information processing systems, vol. 1, 1988.
  14. A. Hu, G. Corrado, N. Griffiths, Z. Murez, C. Gurau, H. Yeo, A. Kendall, R. Cipolla, and J. Shotton, “Model-based imitation learning for urban driving,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 20 703–20 716. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/827cb489449ea216e4a257c47e407d18-Paper-Conference.pdf
  15. A. Barredo Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. Garcia, S. Gil-Lopez, D. Molina, R. Benjamins, R. Chatila, and F. Herrera, “Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai,” Information Fusion, vol. 58, pp. 82–115, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1566253519308103
  16. W. Xu, “From automation to autonomy and autonomous vehicles: Challenges and opportunities for human-computer interaction,” Interactions, vol. 28, no. 1, p. 48–53, dec 2020. [Online]. Available: https://doi.org/10.1145/3434580
  17. M. T. Ribeiro, S. Singh, and C. Guestrin, “" why should i trust you?" explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144.
  18. S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” Advances in neural information processing systems, vol. 30, 2017.
  19. A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important features through propagating activation differences,” in International conference on machine learning.   PMLR, 2017, pp. 3145–3153.
  20. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
  21. K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034, 2013.
  22. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning.   PMLR, 2015, pp. 2048–2057.
  23. J. Kim and J. Canny, “Interpretable learning for self-driving cars by visualizing causal attention,” 2017.
  24. J. Kim, A. Rohrbach, T. Darrell, J. Canny, and Z. Akata, “Textual explanations for self-driving vehicles,” 2018.
  25. M. A. Kühn, D. Omeiza, and L. Kunze, “Textual explanations for automated commentary driving,” arXiv preprint arXiv:2304.08178, 2023.
  26. S. Jain and B. C. Wallace, “Attention is not explanation,” arXiv preprint arXiv:1902.10186, 2019.
  27. Y. Qiang, D. Pan, C. Li, X. Li, R. Jang, and D. Zhu, “Attcat: Explaining transformers via attentive class activation tokens,” in Advances in Neural Information Processing Systems, 2022.
  28. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” 2021.
  29. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, “Flamingo: a visual language model for few-shot learning,” 2022.
  30. J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” 2023.
  31. OpenAI, “Gpt-4 technical report,” 2023.
  32. R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” 2023.
  33. M. Hasanujjaman, M. Z. Chowdhury, and Y. M. Jang, “Sensor fusion in autonomous vehicle with traffic surveillance camera system: Detection, localization, and ai networking,” Sensors, vol. 23, no. 6, 2023. [Online]. Available: https://www.mdpi.com/1424-8220/23/6/3335
  34. J. Roh, C. Paxton, A. Pronobis, A. Farhadi, and D. Fox, “Conditional driving from natural language instructions,” in Conference on Robot Learning.   PMLR, 2020, pp. 540–551.
  35. J. Kim, S. Moon, A. Rohrbach, T. Darrell, and J. Canny, “Advisable learning for self-driving vehicles by internalizing observation-to-action rules,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  36. B. Jin, X. Liu, Y. Zheng, P. Li, H. Zhao, T. Zhang, Y. Zheng, G. Zhou, and J. Liu, “Adapt: Action-aware driving caption transformer,” 2023.
  37. J. Roh, K. Desingh, A. Farhadi, and D. Fox, “Languagerefer: Spatial-language model for 3d visual grounding,” in Conference on Robot Learning.   PMLR, 2022, pp. 1046–1056.
  38. A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, I. Leal, L. Lee, T.-W. E. Lee, S. Levine, Y. Lu, H. Michalewski, I. Mordatch, K. Pertsch, K. Rao, K. Reymann, M. Ryoo, G. Salazar, P. Sanketi, P. Sermanet, J. Singh, A. Singh, R. Soricut, H. Tran, V. Vanhoucke, Q. Vuong, A. Wahid, S. Welker, P. Wohlhart, J. Wu, F. Xia, T. Xiao, P. Xu, S. Xu, T. Yu, and B. Zitkovich, “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in arXiv preprint arXiv:2307.15818, 2023.
  39. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017.
  40. K. Renz, K. Chitta, O.-B. Mercea, A. S. Koepke, Z. Akata, and A. Geiger, “Plant: Explainable planning transformers via object-level representations,” 2022.
  41. F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt outperforms crowd-workers for text-annotation tasks,” 2023.
  42. Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language model with self generated instructions,” 2022.
  43. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” 2023.
  44. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023.
  45. A. Jaegle, S. Borgeaud, J.-B. Alayrac, C. Doersch, C. Ionescu, D. Ding, S. Koppula, D. Zoran, A. Brock, E. Shelhamer, et al., “Perceiver io: A general architecture for structured inputs & outputs,” arXiv preprint arXiv:2107.14795, 2021.
  46. J. Fu, S.-K. Ng, Z. Jiang, and P. Liu, “Gptscore: Evaluate as you desire,” 2023.
  47. J. Wang, Y. Liang, F. Meng, Z. Sun, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou, “Is chatgpt a good nlg evaluator? a preliminary study,” 2023.
  48. Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,” 2023.
  49. F. Codevilla, A. M. López, V. Koltun, and A. Dosovitskiy, “On offline evaluation of vision-based driving models,” 2018.
  50. T. Y. Zhuo, Y. Huang, C. Chen, and Z. Xing, “Exploring ai ethics of chatgpt: A diagnostic analysis,” 2023.
  51. M. Toromanoff, E. Wirbel, and F. Moutarde, “End-to-end model-free reinforcement learning for urban driving using implicit affordances,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7153–7162.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Long Chen (395 papers)
  2. Oleg Sinavski (5 papers)
  3. Jan Hünermann (3 papers)
  4. Alice Karnsund (3 papers)
  5. Andrew James Willmott (1 paper)
  6. Danny Birch (1 paper)
  7. Daniel Maund (1 paper)
  8. Jamie Shotton (21 papers)
Citations (120)