Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Will GPT-4 Run DOOM? (2403.05468v1)

Published 8 Mar 2024 in cs.CL, cs.AI, and cs.CV

Abstract: We show that GPT-4's reasoning and planning capabilities extend to the 1993 first-person shooter Doom. This LLM is able to run and play the game with only a few instructions, plus a textual description--generated by the model itself from screenshots--about the state of the game being observed. We find that GPT-4 can play the game to a passable degree: it is able to manipulate doors, combat enemies, and perform pathing. More complex prompting strategies involving multiple model calls provide better results. While further work is required to enable the LLM to play the game as well as its classical, reinforcement learning-based counterparts, we note that GPT-4 required no training, leaning instead on its own reasoning and observational capabilities. We hope our work pushes the boundaries on intelligent, LLM-based agents in video games. We conclude by discussing the ethical implications of our work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. OpenAI, “GPT-4 technical report,” OpenAI, Tech. Rep., 2023. [Online]. Available: https://arxiv.org/abs/2303.08774v2
  2. J. Lee, T. Le, J. Chen, and D. Lee, “Do language models plagiarize?” in Proceedings of the ACM Web Conference 2023, ser. WWW ’23.   New York, NY, USA: Association for Computing Machinery, 2023, p. 3637–3647. [Online]. Available: https://doi.org/10.1145/3543507.3583199
  3. N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying memorization across neural language models,” in International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=TatRHT_1cK
  4. A. de Wynter, X. Wang, A. Sokolov, Q. Gu, and S.-Q. Chen, “An evaluation on large language model outputs: Discourse and memorization,” Natural Language Processing Journal, vol. 4, p. 100024, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2949719123000213
  5. S. Chiet. (2022) DOOMpad. [Online]. Available: https://samperson.itch.io/doompad
  6. L. Ramlan. (2023) 1-Bit pixels encoded in E. Coli for the display of interactive digital media. [Online]. Available: https://docs.google.com/document/d/1SFm1dS6myqq7psBKttP7CVYN4jO66lOp7ZMA829c_hc
  7. V. Blake. (2020) You can now play DOOM in Minecraft. [Online]. Available: https://www.gamesradar.com/you-can-now-play-doom-in-minecraft/
  8. R. Ferdous, F. Kifetew, D. Prandi, and A. Susi, “Towards agent-based testing of 3D games using reinforcement learning,” in Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’22.   New York, NY, USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3551349.3560507
  9. S. Alvernaz and J. Togelius, “Autoencoder-augmented neuroevolution for visual Doom playing,” in 2017 IEEE Conference on Computational Intelligence and Games (CIG), 2017, pp. 1–8.
  10. G. Lample and D. S. Chaplot, “Playing FPS games with deep reinforcement learning,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, ser. AAAI’17.   AAAI Press, 2017, p. 2140–2146.
  11. S. Song, J. Weng, H. Su, D. Yan, H. Zou, and J. Zhu, “Playing FPS games with environment-aware hierarchical reinforcement learning,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence, ser. IJCAI’19.   AAAI Press, 2019, p. 3475–3482.
  12. M. Wright, Y. Wang, and M. P. Wellman, “Iterated deep reinforcement learning in games: History-aware training for improved stability,” in Proceedings of the 2019 ACM Conference on Economics and Computation, ser. EC ’19.   New York, NY, USA: Association for Computing Machinery, 2019, p. 617–636. [Online]. Available: https://doi.org/10.1145/3328526.3329634
  13. T. Liu, Z. Zheng, H. Li, K. Bian, and L. Song, “Playing card-based RTS games with deep reinforcement learning,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence, ser. IJCAI’19.   AAAI Press, 2019, p. 4540–4546.
  14. M. Zhou, Y. Chen, Y. Wen, Y. Yang, Y. Su, W. Zhang, D. Zhang, and J. Wang, “Factorized q-learning for large-scale multi-agent systems,” in Proceedings of the First International Conference on Distributed Artificial Intelligence, ser. DAI ’19.   New York, NY, USA: Association for Computing Machinery, 2019. [Online]. Available: https://doi.org/10.1145/3356464.3357707
  15. J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen, “What makes good in-context examples for GPT-3?” in Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, E. Agirre, M. Apidianaki, and I. Vulić, Eds.   Dublin, Ireland and Online: Association for Computational Linguistics, May 2022, pp. 100–114. [Online]. Available: https://aclanthology.org/2022.deelio-1.10
  16. Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp, “Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 8086–8098. [Online]. Available: https://aclanthology.org/2022.acl-long.556
  17. A. de Wynter and T. Yuan, ““I wish to have an argument!”: Argumentative reasoning in large language models,” ArXiv, vol. abs/2309.16938, 2023. [Online]. Available: https://arxiv.org/abs/2309.16938
  18. J. Wei, X. Wang, D. Schuurmans, M. Bosma, b. ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 24 824–24 837. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf
  19. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 22 199–22 213. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/8bb0d291acd4acf06ef112099c16f326-Paper-Conference.pdf
  20. S. J. Han, K. J. Ransom, A. Perfors, and C. Kemp, “Human-like property induction is a challenge for large language models,” in Proceedings of the 44th Annual Conference of the Cognitive Science Society (CogSci 2022).   Toronto, Canada: Annual Conference of the Cognitive Science Society (CogSci), 2022, pp. 1–8.
  21. N. Dziri, X. Lu, M. Sclar, X. L. Li, L. Jian, B. Y. Lin, P. West, C. Bhagavatula, R. L. Bras, J. D. Hwang, S. Sanyal, S. Welleck, X. Ren, A. Ettinger, Z. Harchaoui, and Y. Choi, “Faith and fate: Limits of transformers on compositionality,” ArXiv, vol. abs/2305.18654, 2023. [Online]. Available: https://arxiv.org/abs/2305.18654
  22. A. Saparov and H. He, “Language models are greedy reasoners: A systematic formal analysis of chain-of-thought,” in International Conference on Learning Representations (ICLR), 2023.
  23. C. Anil, Y. Wu, A. J. Andreassen, A. Lewkowycz, V. Misra, V. V. Ramasesh, A. Slone, G. Gur-Ari, E. Dyer, and B. Neyshabur, “Exploring length generalization in large language models,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=zSkYVeX7bC4
  24. M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. Le, E. Chi, D. Zhou, and J. Wei, “Challenging BIG-bench tasks and whether chain-of-thought can solve them,” in Findings of the Association for Computational Linguistics: ACL 2023.   Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 13 003–13 051. [Online]. Available: https://aclanthology.org/2023.findings-acl.824
  25. J. Chen, S. Yuan, R. Ye, B. P. Majumder, and K. Richardson, “Put your money where your mouth is: Evaluating strategic planning and execution of LLM agents in an auction arena,” 2023. [Online]. Available: https://openreview.net/forum?id=crMMk4I8Wy
  26. W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in Proceedings of the 39th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, Eds., vol. 162.   PMLR, 17–23 Jul 2022, pp. 9118–9147. [Online]. Available: https://proceedings.mlr.press/v162/huang22a.html
  27. B. M. Vishal Pallagani, K. Murugesan, F. Rossi, B. Srivastava, L. Horesh, F. Fabiano, and A. Loreggia, “Understanding the capabilities of large language models for automated planning,” ArXiv, vol. abs/2305.16151, 2023. [Online]. Available: https://arxiv.org/abs/2305.16151
  28. S. Sun, Y. Liu, S. Wang, C. Zhu, and M. Iyyer, “PEARL: Prompting large language models to plan and execute actions over long documents,” ArXiv, vol. abs/2305.14564, 2023. [Online]. Available: https://arxiv.org/abs/2305.14564
  29. S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu, “Reasoning with language model is planning with world model,” ArXiv, vol. abs/2305.14992, 2023. [Online]. Available: https://arxiv.org/abs/2305.14992
  30. Y. Zhang, S. Mao, T. Ge, X. Wang, Y. Xia, M. Lan, and F. Wei, “K-Level reasoning with large language models,” ArXiv, vol. abs/2402.01521, 2024. [Online]. Available: https://arxiv.org/abs/2402.01521
  31. L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, and E.-P. Lim, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds.   Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 2609–2634. [Online]. Available: https://aclanthology.org/2023.acl-long.147
  32. C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, and Y. Su, “LLM-Planner: Few-shot grounded planning for embodied agents with large language models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 2998–3009.
  33. X. Liu, D. Yin, Y. Feng, and D. Zhao, “Things not written in text: Exploring spatial commonsense from visual signals,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds.   Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 2365–2376. [Online]. Available: https://aclanthology.org/2022.acl-long.168
  34. Y. Lu, W. Zhu, X. Wang, M. Eckstein, and W. Y. Wang, “Imagination-augmented natural language understanding,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz, Eds.   Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 4392–4402. [Online]. Available: https://aclanthology.org/2022.naacl-main.326
  35. Y. Yang, W. Yao, H. Zhang, X. Wang, D. Yu, and J. Chen, “Z-LaVI: Zero-shot language solver fueled by visual imagination,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.   Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 1186–1203. [Online]. Available: https://aclanthology.org/2022.emnlp-main.78
  36. L. Zhang, Q. Chen, J. Siebert, and B. Tang, “Semi-supervised visual feature integration for language models through sentence visualization,” in Proceedings of the 2021 International Conference on Multimodal Interaction, ser. ICMI ’21.   New York, NY, USA: Association for Computing Machinery, 2021, p. 682–686. [Online]. Available: https://doi.org/10.1145/3462244.3479965
  37. Y. Lu, P. Lu, Z. Chen, W. Zhu, X. E. Wang, and W. Y. Wang, “Multimodal procedural planning via dual text-image prompting,” ArXiv, vol. abs/2305.01795, 2023. [Online]. Available: https://arxiv.org/abs/2305.01795
  38. Q. Wang, M. Li, H. P. Chan, L. Huang, J. Hockenmaier, G. Chowdhary, and H. Ji, “Multimedia generative script learning for task planning,” in Findings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds.   Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 986–1008. [Online]. Available: https://aclanthology.org/2023.findings-acl.63
  39. Z. Wang, S. Cai, G. Chen, A. Liu, X. Ma, and Y. Liang, “Describe, explain, plan and select: Interactive planning with LLMs enables open-world multi-task agents,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=KtvPdGb31Z
  40. S. Hu, T. Huang, and L. Liu, “PokéLLMon: A human-parity agent for Pokémon Battles with large language models,” ArXiv, vol. abs/2402.01118, 2024. [Online]. Available: https://arxiv.org/abs/2402.01118
  41. W. Ma, Q. Mi, X. Yan, Y. Wu, R. Lin, H. Zhang, and J. Wang, “Large language models play Starcraft II: Benchmarks and a chain of summarization approach,” ArXiv, vol. abs/2312.11865, 2024. [Online]. Available: https://arxiv.org/abs/2312.11865
  42. G. Aloupis, E. D. Demaine, A. Guo, and G. Viglietta, “Classic nintendo games are (computationally) hard,” Theoretical Computer Science, vol. 586, pp. 135–160, 2015, fun with Algorithms. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0304397515001735
  43. F. Brahman, V. Shwartz, R. Rudinger, and Y. Choi, “Learning to rationalize for nonmonotonic reasoning with distant supervision,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 14, pp. 12 592–12 601, May 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/17492
  44. DoomWiki.org. E1m1: Hangar (doom). Accessed 7 February 2024. [Online]. Available: https://doomwiki.org/wiki/E1M1:_Hangar_(Doom)
  45. A. Kanervisto, J. Pussinen, and V. Hautamäki, “Benchmarking end-to-end behavioural cloning on video games,” in 2020 IEEE Conference on Games (CoG), 2020, pp. 558–565.
  46. S. Milani, A. Juliani, I. Momennejad, R. Georgescu, J. Rzepecki, A. Shaw, G. Costello, F. Fang, S. Devlin, and K. Hofmann, “Navigates like me: Understanding how people evaluate human-like AI in video games,” in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, ser. CHI ’23.   New York, NY, USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3544548.3581348
  47. P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, and A. R. Zamir, “On evaluation of embodied navigation agents,” ArXiv, vol. abs/1807.06757, 2018. [Online]. Available: https://arxiv.org/abs/1807.06757
  48. D. S. Chaplot, D. Gandhi, A. Gupta, and R. Salakhutdinov, “Object goal navigation using goal-oriented semantic exploration,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS’20.   Red Hook, NY, USA: Curran Associates Inc., 2020.
  49. S. M. Goodman, E. Buehler, P. Clary, A. Coenen, A. Donsbach, T. N. Horne, M. Lahav, R. MacDonald, R. B. Michaels, A. Narayanan, M. Pushkarna, J. Riley, A. Santana, L. Shi, R. Sweeney, P. Weaver, A. Yuan, and M. R. Morris, “LaMPost: Design and evaluation of an AI-assisted email writing prototype for adults with dyslexia,” in Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility, ser. ASSETS ’22.   New York, NY, USA: Association for Computing Machinery, 2022. [Online]. Available: https://doi.org/10.1145/3517428.3544819
  50. M. Wydmuch, M. Kempka, and W. Jaśkowski, “ViZDoom Competitions: Playing Doom from Pixels,” IEEE Transactions on Games, vol. 11, no. 3, pp. 248–259, 2019.
  51. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, “Llama 2: Open foundation and fine-tuned chat models,” ArXiv, vol. abs/2307.09288v2, 2023. [Online]. Available: https://arxiv.org/abs/2307.09288v2
Citations (3)

Summary

  • The paper demonstrates GPT-4's ability to play Doom by interpreting visual game data into textual descriptions without additional training.
  • The paper employs a two-component setup with varied prompting strategies to assess strengths in navigation and combat while revealing memory and planning limitations.
  • The paper highlights the potential for LLM applications in gaming and simulations, emphasizing the need for enhanced reasoning and long-term strategic planning.

Investigating the Capabilities of GPT-4 in Playing Doom

Introduction

In a novel approach to exploring the planning and reasoning capabilities of LLMs, a paper has demonstrated the ability of GPT-4 to engage with and play the 1993 first-person shooter game, Doom. This exploration is driven by the aim to understand the extent to which LLMs, specifically GPT-4, can process complex environments and exhibit decision-making skills in a gaming context. Unlike traditional AI agents designed for gaming, which often rely on extensive training or fine-tuning on specific tasks, GPT-4 requires no additional training to interpret game dynamics and make strategic decisions based on textual descriptions generated from game screenshots.

Methodology

The research leverages a two-component setup consisting of a Vision component, which processes screenshots from Doom and provides textual descriptions of the game state, and an Agent model that decides on the actions to take based on these descriptions. The system is further enhanced with a Planner for generating a fine-grained plan of action and Experts for offering specialized advice, thereby creating a more sophisticated prompting strategy for GPT-4 to navigate the game. The game itself is interfaced through a Python binding of the original Doom engine, allowing seamless integration with the GPT-4 models.

Multiple prompting strategies were employed to assess GPT-4's gameplay performance, ranging from a naïve approach with minimal instruction to more complex methods involving walkthroughs and k-level planning. By adjusting these strategies, the paper aims to dissect the planning and reasoning intricacies of LLMs in a dynamic gaming environment.

Results

The findings reveal GPT-4's capability to play Doom at a basic level, including navigating environments, engaging enemies, and managing game resources. More intricate prompting strategies, particularly those involving multiple calls to GPT-4 for planning and advice, yielded better gameplay results. However, the LLM exhibited limitations in memory recall and the depth of reasoning, impacting its ability to perform long-term strategic planning.

Discussion

The paper underscores the potential of LLMs to process complex environments and make informed decisions without explicit training on the specific task. The success of GPT-4 in navigating the game environment of Doom suggests a promising avenue for developing intelligent agents capable of tackling a wide range of problem-solving and planning tasks. The work also sheds light on the challenges faced by LLMs in terms of memory retention and reasoning depth, pointing to areas for future improvement.

Implications for AI Development

This research contributes to a deeper understanding of the capabilities of LLMs in unconventional applications beyond text processing. The ability of GPT-4 to interact with and make decisions in a video game environment opens up new possibilities for employing LLMs in game testing, simulation-based learning, and possibly even in developing non-player characters (NPCs) in games. Moreover, the paper highlights the need for further exploration into enhancing the memory and reasoning capabilities of LLMs, which could lead to more sophisticated AI agents capable of complex decision-making and problem-solving.

Ethical Considerations

The paper briefly discusses the ethical implications of employing LLMs in gaming contexts, especially in scenarios that might simulate real-world activities. As the technology advances, it's crucial to navigate the development and application of LLMs responsibly, ensuring they contribute positively to society and do not inadvertently facilitate harmful behaviors.

Conclusion

This investigation into GPT-4's gameplay capabilities in Doom is a stepping stone towards understanding the potential and limitations of LLMs in dynamic and complex environments. While GPT-4 exhibits impressive planning and decision-making abilities, its performance underlines the necessity for advancements in memory and reasoning faculties. Future research in this domain can pave the way for more versatile and intelligent AI agents, expanding the horizons of AI application in gaming and beyond.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. Will GPT-4 Run Doom? (5 points, 2 comments)
  2. GPT-4 can run and play DOOM (5 points, 0 comments)
Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit

  1. [R] [2403.05468] Will GPT-4 Run DOOM? (58 points, 19 comments)