Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SmartPlay: A Benchmark for LLMs as Intelligent Agents (2310.01557v5)

Published 2 Oct 2023 in cs.LG and cs.AI
SmartPlay: A Benchmark for LLMs as Intelligent Agents

Abstract: Recent LLMs have demonstrated great potential toward intelligent agents and next-gen automation, but there currently lacks a systematic benchmark for evaluating LLMs' abilities as agents. We introduce SmartPlay: both a challenging benchmark and a methodology for evaluating LLMs as agents. SmartPlay consists of 6 different games, including Rock-Paper-Scissors, Tower of Hanoi, Minecraft. Each game features a unique setting, providing up to 20 evaluation settings and infinite environment variations. Each game in SmartPlay uniquely challenges a subset of 9 important capabilities of an intelligent LLM agent, including reasoning with object dependencies, planning ahead, spatial reasoning, learning from history, and understanding randomness. The distinction between the set of capabilities each game test allows us to analyze each capability separately. SmartPlay serves not only as a rigorous testing ground for evaluating the overall performance of LLM agents but also as a road-map for identifying gaps in current methodologies. We release our benchmark at github.com/Microsoft/SmartPlay

SmartPlay: A Comprehensive Benchmark for Assessing LLM Capabilities as Intelligent Agents

The paper "SmartPlay: A Benchmark for LLMs as Intelligent Agents," presents a seminal effort to evaluate the capabilities of LLMs for functioning as intelligent agents. Despite recent advances in LLMs, a standardized benchmark to assess their interaction with dynamic environments and decision-making processes in agent-based settings has been lacking. This work addresses this gap by introducing "SmartPlay", a suite of tests designed to evaluate LLMs across a diverse array of capabilities using game-based scenarios.

Summary and Contributions

SmartPlay is a meticulously structured benchmark involving six games — Two-Armed Bandits, Rock Paper Scissors, Tower of Hanoi, Messenger, Crafter, and Minecraft. Each game is selected to challenge specific aspects of LLM capabilities, including reasoning, planning, spatial reasoning, learning from history, and understanding of randomness. The games represent varied complexities, from simple probabilistic reasoning tasks in Two-Armed Bandits to complex 3D spatial reasoning challenges in Minecraft.

A major contribution of the paper is the structured capability analysis. It delineates nine key abilities crucial for intelligent agents and assigns a degree of challenge each game presents to these abilities. For example, Rock Paper Scissors emphasizes understanding the odds, while Messenger stresses on spatial reasoning and syntax variation comprehension. This granularity allows for a detailed assessment of LLMs' strengths and limitations.

The paper provides a comprehensive evaluation of various LLMs, including GPT-4 variants, text-davinci-003, Claude, Bard, and open-source models like LLaMA. The results underscore significant performance disparities between models, particularly highlighting the superior performance of GPT-4 variants. However, even state-of-the-art LLMs show substantial gaps in planning and spatial reasoning capabilities when compared to human baselines.

Implications and Future Directions

The introduction of SmartPlay has profound implications for future AI research. It provides a standardized approach to evaluate and improve the agentive capabilities of LLMs, which could accelerate their deployment in real-world applications requiring interactive decision-making. The benchmark identifies current gaps in LLMs, such as challenges in learning from interactions and executing long-horizon planning, thus directing future research towards these areas.

SmartPlay also contributes to robustness in evaluation by using games with procedurally generated environments, minimizing issues of data contamination found in static datasets. This supports fair assessments of LLM generalization capabilities, especially in complex environments like Minecraft.

In terms of future development, SmartPlay offers a flexible framework for incorporating additional games, allowing it to evolve alongside advancements in AI. Researchers should anticipate expanding SmartPlay further to include newer AI models and elaborate on capabilities like error correction and contextual adaptability, vital for next-gen automation.

Conclusion

This paper establishes SmartPlay as a rigorous, multifaceted benchmark for evaluating LLMs as intelligent agents. By leveraging the interactive nature of games, it provides a thorough investigation into crucial areas of LLM functionalities, notably planning, spatial reasoning, and interaction-based learning. The findings not only reveal current model limitations but also chart a path for future innovations in the field of autonomous intelligent agents, enhancing the applicability of LLMs across diverse sectors of AI-driven automation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Imitating interactive intelligence. arXiv preprint arXiv:2012.05672, 2020.
  2. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  4. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.
  5. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  6. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  7. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  8. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  10. Textworld: A learning environment for text-based games. In Workshop on Computer Games, pp.  41–75. Springer, 2018.
  11. Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7, pp.  41–75. Springer, 2019.
  12. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  13. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
  14. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
  15. Minedojo: Building open-ended embodied agents with internet-scale knowledge. arXiv preprint arXiv:2206.08853, 2022.
  16. General game playing: Overview of the aaai competition. AI magazine, 26(2):62–62, 2005.
  17. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  18. The minerl 2020 competition on sample efficient reinforcement learning using human priors. arXiv preprint arXiv:2101.11071, 2021.
  19. Danijar Hafner. Benchmarking the spectrum of agent capabilities. arXiv preprint arXiv:2109.06780, 2021.
  20. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023.
  21. Grounding language to entities and dynamics for generalization in reinforcement learning. In International Conference on Machine Learning, pp.  4051–4062. PMLR, 2021.
  22. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  23. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023.
  24. Raph Koster. Theory of fun for game design. ” O’Reilly Media, Inc.”, 2013.
  25. The nethack learning environment. Advances in Neural Information Processing Systems, 33:7671–7684, 2020.
  26. Soar: An architecture for general intelligence. Artificial intelligence, 33(1):1–64, 1987.
  27. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
  28. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  29. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  30. James Manyika. An overview of bard: an early experiment with generative ai. https://ai.google/static/documents/google-about-bard.pdf. Accessed: May 27, 2023.
  31. Language models are few-shot butlers. arXiv preprint arXiv:2104.07972, 2021.
  32. OpenAI. Gpt-4 technical report, 2023.
  33. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
  34. Barney Pell. Strategy generation and evaluation for meta-game playing. KI-Künstliche Intelligenz, 25(1):71–72, 2011.
  35. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023.
  36. Reworkd. reworkd/agentgpt: Assemble, configure, and deploy autonomous ai agents in your browser. URL https://github.com/reworkd/AgentGPT.
  37. Stuart J Russell. Artificial intelligence a modern approach. Pearson Education, Inc., 2010.
  38. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9339–9347, 2019.
  39. Measuring intelligence through games. arXiv preprint arXiv:1109.1314, 2011.
  40. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  41. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10740–10749, 2020a.
  42. Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768, 2020b.
  43. Significant-Gravitas. Significant-gravitas/auto-gpt: An experimental open-source attempt to make gpt-4 fully autonomous. URL https://github.com/Significant-Gravitas/Auto-GPT.
  44. Using deepspeed and megatron to train megatron-turing NLG 530b, A large-scale generative language model. CoRR, abs/2201.11990, 2022. URL https://arxiv.org/abs/2201.11990.
  45. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022a.
  46. Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments. In Conference on Robot Learning, pp.  477–490. PMLR, 2022b.
  47. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  48. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020. ISSN 2665-9638. doi: https://doi.org/10.1016/j.simpa.2020.100022. URL https://www.sciencedirect.com/science/article/pii/S2665963820300099.
  49. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782, 2017.
  50. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  51. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023b.
  52. Report on the 2008 reinforcement learning competition. AI Magazine, 31(2):81–81, 2010.
  53. Protecting against evaluation overfitting in empirical reinforcement learning. In 2011 IEEE symposium on adaptive dynamic programming and reinforcement learning (ADPRL), pp.  120–127. IEEE, 2011.
  54. Intelligent agents: Theory and practice. The knowledge engineering review, 10(2):115–152, 1995.
  55. Read and reap the rewards: Learning to play atari with the help of instruction manuals. arXiv preprint arXiv:2302.04449, 2023a.
  56. Plan, eliminate, and track–language models are good teachers for embodied agents. arXiv preprint arXiv:2305.02412, 2023b.
  57. Spring: Gpt-4 out-performs rl algorithms by studying papers and reasoning. arXiv preprint arXiv:2305.15486, 2023c.
  58. Yoheinakajima. yoheinakajima/babyagi. URL https://github.com/yoheinakajima/babyagi.
  59. Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. arXiv preprint arXiv:2303.16563, 2023.
  60. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023a.
  61. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023b.
  62. Rtfm: Generalising to novel environment dynamics via reading. arXiv preprint arXiv:1910.08210, 2019.
  63. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
  64. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yue Wu (338 papers)
  2. Xuan Tang (25 papers)
  3. Tom M. Mitchell (20 papers)
  4. Yuanzhi Li (119 papers)
Citations (46)
Youtube Logo Streamline Icon: https://streamlinehq.com