Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Probing Multimodal LLMs as World Models for Driving (2405.05956v2)

Published 9 May 2024 in cs.RO and cs.CV
Probing Multimodal LLMs as World Models for Driving

Abstract: We provide a sober look at the application of Multimodal LLMs (MLLMs) in autonomous driving, challenging common assumptions about their ability to interpret dynamic driving scenarios. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored. Our experimental study assesses various MLLMs as world models using in-car camera perspectives and reveals that while these models excel at interpreting individual images, they struggle to synthesize coherent narratives across frames, leading to considerable inaccuracies in understanding (i) ego vehicle dynamics, (ii) interactions with other road actors, (iii) trajectory planning, and (iv) open-set scene reasoning. We introduce the Eval-LLM-Drive dataset and DriveSim simulator to enhance our evaluation, highlighting gaps in current MLLM capabilities and the need for improved models in dynamic real-world environments.

Evaluating Multimodal LLMs for Autonomous Driving

Introduction

In the domain of AI and autonomous driving, the potential role of Multimodal LLMs (MLLMs) such as GPT-4V has been an area of both excitement and scrutiny. The primary goal was to determine if MLLMs can act as world models in autonomous driving scenarios, particularly through their ability to process and make decisions based on sequential imagery from a car's camera view.

Core Challenge in Dynamic Driving Environments

The allure of employing MLLMs in autonomous vehicles lies in their sophisticated capabilities to integrate and interpret multimodal data (like images and texts). However, when these models are tested in dynamic, less controlled environments such as driving, their efficacy can be significantly different.

Sequential Frame Analysis

The trials explored how well these AI models could stitch together coherent narratives from sequences of driving images. The dynamic aspects, including vehicle motion, other moving objects, and rapid changes in the environment, proved to be particularly challenging for the models.

Key Findings

One surprising discovery was the models' overall weakness in logical sequence synthesis and dynamic reasoning:

  • Basic vehicle dynamics predictions like forward or backward movement were often flawed, showing biases toward certain actions irrespective of the scenario (e.g., constant prediction of forward movement).
  • Performance deteriorated further when the models were asked to interpret complex interactions with other vehicles or unexpected road events.

The Role of Simulation

To effectively test these models, the paper introduced a specialized driving simulator that could generate a wide range of road situations. This tool allowed researchers to rigorously challenge the predictive and reasoning powers of MLLMs under diverse conditions.

Future Outlook

Despite the current limitations, the utilitarian value of improving MLLMs for driving applications remains significant. Enhanced models could potentially transform how autonomous vehicles interpret their surroundings, make decisions, and learn from diverse driving conditions. However, substantial improvements in model training, including better dataset representation and advanced simulation capabilities, are necessary steps forward.

Conclusion

While MLLMs like GPT-4V have showcased impressive abilities in controlled environments, their application as reliable world models in autonomous driving still faces significant hurdles. The current paper shed light on critical gaps, primarily in dynamic reasoning and logical sequence formation across driving frames. Addressing these challenges will be pivotal in advancing the reliability and safety of AI-driven autonomous vehicles in real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu, “Harnessing the power of llms in practice: A survey on chatgpt and beyond,” ACM Trans. Knowl. Discov. Data, feb 2024, just Accepted. [Online]. Available: https://doi.org/10.1145/3649506
  2. L. Chen, O. Sinavski, J. Hünermann, A. Karnsund, A. J. Willmott, D. Birch, D. Maund, and J. Shotton, “Driving with llms: Fusing object-level vector modality for explainable autonomous driving,” arXiv preprint arXiv:2310.01957, 2023.
  3. T.-H. Wang, A. Maalouf, W. Xiao, Y. Ban, A. Amini, G. Rosman, S. Karaman, and D. Rus, “Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models,” in 2024 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2024.
  4. Y. Yang, Q. Zhang, C. Li, D. S. o. Marta, N. Batool, and J. Folkesson, “Human-centric autonomous systems with llms for user command reasoning,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, January 2024, pp. 988–994.
  5. D. Fu, X. Li, L. Wen, M. Dou, P. Cai, B. Shi, and Y. Qiao, “Drive like a human: Rethinking autonomous driving with large language models,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 910–919.
  6. C. Cui, Y. Ma, X. Cao, W. Ye, Y. Zhou, K. Liang, J. Chen, J. Lu, Z. Yang, K.-D. Liao et al., “A survey on multimodal large language models for autonomous driving,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 958–979.
  7. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139.   PMLR, 18–24 Jul 2021, pp. 8748–8763. [Online]. Available: https://proceedings.mlr.press/v139/radford21a.html
  8. J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
  9. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  10. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022.
  11. R. Ding, J. Yang, C. Xue, W. Zhang, S. Bai, and X. Qi, “Pla: Language-driven open-vocabulary 3d scene understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7010–7019.
  12. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Audio visual language maps for robot navigation,” arXiv preprint arXiv:2303.07522, 2023.
  13. S. Li, X. Puig, C. Paxton, Y. Du, C. Wang, L. Fan, T. Chen, D.-A. Huang, E. Akyürek, A. Anandkumar et al., “Pre-trained language models for interactive decision-making,” Advances in Neural Information Processing Systems, vol. 35, pp. 31 199–31 212, 2022.
  14. A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu et al., “Rt-1: Robotics transformer for real-world control at scale,” arXiv preprint arXiv:2212.06817, 2022.
  15. Y. Bisk, A. Holtzman, J. Thomason, J. Andreas, Y. Bengio, J. Chai, M. Lapata, A. Lazaridou, J. May, A. Nisnevich et al., “Experience grounds language,” arXiv preprint arXiv:2004.10151, 2020.
  16. S. Tellex, N. Gopalan, H. Kress-Gazit, and C. Matuszek, “Robots that use language,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, pp. 25–55, 2020.
  17. M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman et al., “Do as i can, not as i say: Grounding language in robotic affordances,” arXiv preprint arXiv:2204.01691, 2022.
  18. S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouser et al., “Openscene: 3d scene understanding with open vocabularies,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 815–824.
  19. K. M. Jatavallabhula, A. Kuwajerwala, Q. Gu, M. Omama, T. Chen, S. Li, G. Iyer, S. Saryazdi, N. Keetha, A. Tewari et al., “Conceptfusion: Open-set multimodal 3d mapping,” arXiv preprint arXiv:2302.07241, 2023.
  20. A. Maalouf, N. Jadhav, K. M. Jatavallabhula, M. Chahine, D. M. Vogt, R. J. Wood, A. Torralba, and D. Rus, “Follow anything: Open-set detection, tracking, and following in real-time,” IEEE Robotics and Automation Letters, vol. 9, no. 4, pp. 3283–3290, 2024.
  21. B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, and R. Ranftl, “Language-driven semantic segmentation,” arXiv preprint arXiv:2201.03546, 2022.
  22. G. Ghiasi, X. Gu, Y. Cui, and T.-Y. Lin, “Scaling open-vocabulary image segmentation with image-level labels,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI.   Springer, 2022, pp. 540–557.
  23. S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” arXiv preprint arXiv:2303.05499, 2023.
  24. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
  25. A. Maalouf, N. Jadhav, K. M. Jatavallabhula, M. Chahine, D. M. Vogt, R. J. Wood, A. Torralba, and D. Rus, “Follow anything: Open-set detection, tracking, and following in real-time,” arXiv preprint arXiv:2308.05737, 2023.
  26. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8821–8831.
  27. O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “Styleclip: Text-driven manipulation of stylegan imagery,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2085–2094.
  28. K. Crowson, S. Biderman, D. Kornis, D. Stander, E. Hallahan, L. Castricato, and E. Raff, “Vqgan-clip: Open domain image generation and editing with natural language guidance,” in European Conference on Computer Vision.   Springer, 2022, pp. 88–105.
  29. J. Kim, T. Misu, Y.-T. Chen, A. Tawari, and J. Canny, “Grounding human-to-vehicle advice for self-driving vehicles,” Nov. 2019.
  30. S. Tan, B. Ivanovic, X. Weng, M. Pavone, and P. Kraehenbuehl, “Language conditioned traffic generation,” arXiv preprint arXiv:2307.07947, 2023.
  31. Z. Zhong, D. Rempe, Y. Chen, B. Ivanovic, Y. Cao, D. Xu, M. Pavone, and B. Ray, “Language-guided traffic simulation via scene-level diffusion,” arXiv preprint arXiv:2306.06344, 2023.
  32. D. Omeiza, H. Webb, M. Jirotka, and L. Kunze, “Explanations in autonomous driving: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 10 142–10 162, 2021.
  33. Y.-L. Kuo, X. Huang, A. Barbu, S. G. McGill, B. Katz, J. J. Leonard, and G. Rosman, “Trajectory prediction with linguistic representations,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 2868–2875.
  34. C. Cui, Y. Ma, X. Cao, W. Ye, and Z. Wang, “Drive as you speak: Enabling human-like interaction with large language models in autonomous vehicles,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 902–909.
  35. ——, “Receive, reason, and react: Drive as you say with large language models in autonomous vehicles,” arXiv preprint arXiv:2310.08034, 2023.
  36. N. Sriram, T. Maniar, J. Kalyanasundaram, V. Gandhi, B. Bhowmick, and K. M. Krishna, “Talk to the vehicle: Language conditioned autonomous navigation of self driving cars,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2019, pp. 5284–5290.
  37. Z. Wang, S. Cai, G. Chen, A. Liu, X. S. Ma, and Y. Liang, “Describe, explain, plan and select: interactive planning with llms enables open-world multi-task agents,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  38. M. Omama, P. Inani, P. Paul, S. C. Yellapragada, K. M. Jatavallabhula, S. Chinchali, and M. Krishna, “Alt-pilot: Autonomous navigation with language augmented topometric maps,” arXiv preprint arXiv:2310.02324, 2023.
  39. Z. Wang, S. Cai, G. Chen, A. Liu, X. S. Ma, and Y. Liang, “Describe, explain, plan and select: Interactive planning with llms enables open-world multi-task agents,” in Advances in Neural Information Processing Systems, A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36.   Curran Associates, Inc., 2023, pp. 34 153–34 189.
  40. H. Sha, Y. Mu, Y. Jiang, L. Chen, C. Xu, P. Luo, S. E. Li, M. Tomizuka, W. Zhan, and M. Ding, “Languagempc: Large language models as decision makers for autonomous driving,” arXiv preprint arXiv:2310.03026, 2023.
  41. R. Tedrake et al., “Drake: Model-based design and verification for robotics,” 2019.
  42. A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” in Conference on robot learning.   PMLR, 2017, pp. 1–16.
  43. S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” in Field and Service Robotics: Results of the 11th International Conference.   Springer, 2018, pp. 621–635.
  44. A. Amini, T.-H. Wang, I. Gilitschenski, W. Schwarting, Z. Liu, S. Han, S. Karaman, and D. Rus, “Vista 2.0: An open, data-driven simulator for multimodal sensing and policy learning for autonomous vehicles,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 2419–2426.
  45. H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
  46. Y. Chen, F. Rong, S. Duggal, S. Wang, X. Yan, S. Manivasagam, S. Xue, E. Yumer, and R. Urtasun, “Geosim: Realistic video simulation via geometry-aware composition for self-driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7230–7240.
  47. Z. Yang, Y. Chen, J. Wang, S. Manivasagam, W.-C. Ma, A. J. Yang, and R. Urtasun, “Unisim: A neural closed-loop sensor simulator,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1389–1399.
  48. S. Suo, S. Regalado, S. Casas, and R. Urtasun, “Trafficsim: Learning to simulate realistic multi-agent behaviors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 400–10 409.
  49. Q. Li, Z. M. Peng, L. Feng, Z. Liu, C. Duan, W. Mo, and B. Zhou, “Scenarionet: Open-source platform for large-scale traffic scenario simulation and modeling,” Advances in neural information processing systems, vol. 36, 2024.
  50. W. Ding, Y. Cao, D. Zhao, C. Xiao, and M. Pavone, “Realgen: Retrieval augmented generation for controllable traffic scenarios,” arXiv preprint arXiv:2312.13303, 2023.
  51. J. Park, K. Joo, Z. Hu, C.-K. Liu, and I. So Kweon, “Non-local spatial propagation network for depth completion,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16.   Springer, 2020, pp. 120–136.
  52. M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 142–13 153.
  53. K. H. Ang, G. Chong, and Y. Li, “Pid control system analysis, design, and technology,” IEEE transactions on control systems technology, vol. 13, no. 4, pp. 559–576, 2005.
  54. M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states in empirical observations and microscopic simulations,” Physical review E, vol. 62, no. 2, p. 1805, 2000.
  55. T. M. Howard, C. J. Green, A. Kelly, and D. Ferguson, “State space sampling of feasible motions for high-performance mobile robot navigation in complex environments,” Journal of Field Robotics, vol. 25, no. 6-7, pp. 325–345, 2008.
  56. Fusseldieb, “Reading videos with gpt4v,” 2023. [Online]. Available: "https://community.openai.com/t/reading-videos-with-gpt4v/523568"
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shiva Sreeram (3 papers)
  2. Tsun-Hsuan Wang (37 papers)
  3. Alaa Maalouf (27 papers)
  4. Guy Rosman (42 papers)
  5. Sertac Karaman (77 papers)
  6. Daniela Rus (181 papers)
Citations (5)