Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments (2403.11807v4)

Published 18 Mar 2024 in cs.AI and cs.CL
How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

Abstract: Decision-making is a complex process requiring diverse abilities, making it an excellent framework for evaluating LLMs. Researchers have examined LLMs' decision-making through the lens of Game Theory. However, existing evaluation mainly focus on two-player scenarios where an LLM competes against another. Additionally, previous benchmarks suffer from test set leakage due to their static design. We introduce GAMA($\gamma$)-Bench, a new framework for evaluating LLMs' Gaming Ability in Multi-Agent environments. It includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to quantitatively assess LLMs' performance. $\gamma$-Bench allows flexible game settings and adapts the scoring system to different game parameters, enabling comprehensive evaluation of robustness, generalizability, and strategies for improvement. Our results indicate that GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought. We also evaluate twelve LLMs from six model families, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. Gemini-1.5-Pro outperforms others, scoring of $68.1$ out of $100$, followed by LLaMA-3.1-70B ($64.5$) and Mixtral-8x22B ($61.4$). All code and experimental results are publicly available via https://github.com/CUHK-ARISE/GAMABench.

Evaluating LLMs' Decision-Making in Multi-Agent Environments

Overview

LLMs have demonstrated remarkable capabilities across various tasks. However, evaluating these models' decision-making abilities, especially in complex scenarios involving multiple agents, remains a challenging frontier. This paper introduces a comprehensive framework designed to assess LLMs in the context of Game Theory, named γ\gamma-Bench. It comprises eight classical multi-agent games, categorically grouped to analyze models' performance across different strategic nuances such as cooperation, competition, and mixed motives. The framework not only evaluates the decision-making prowess of LLMs but also provides insights into their robustness, generalizability, and potential improvement strategies. Notably, it reveals that while models like GPT-3.5 exhibit robust decision-making capabilities, their generalizability across different games remains constrained. The paper also highlights the apparent enhancement in decision-making abilities of subsequent LLM versions, with GPT-4 outperforming its predecessors.

Methodological Approach

The research meticulously crafts a scoring scheme tailored for quantitatively measuring LLMs' performance in the game-theoretic context. Key elements of the methodology include:

  • Framework Design: γ\gamma-Bench incorporates eight strategically selected games, allowing for a nuanced analysis of LLMs' decision-making in scenarios involving cooperation, competition, and a blend of both.
  • Scoring Scheme: A novel scoring system is proposed to quantitatively assess LLMs' performance, focusing on their strategic soundness and effectiveness in various gaming contexts.
  • Robustness and Generalizability: The framework evaluates models' robustness in game strategy execution and their generalizability across different gaming setups.

Experimental Findings

The paper presents a thorough comparative analysis of several LLMs, including different versions of GPT-3.5 and GPT-4, through γ\gamma-Bench. Some of the pivotal experimental findings are:

  • Performance Rankings: GPT-4 emerges as the top-performing model with a score of 72.5, outshining its predecessors and showcasing notable advancements in LLMs' decision-making abilities.
  • Robustness vs. Generalizability: While models like GPT-3.5 demonstrate substantial robustness in their strategic implementations, they exhibit limited generalizability across diverse game setups.
  • Version-wise Improvement: Sequential versions of GPT-3.5 show a progressive enhancement in intelligence and decision-making capability, illustrating the rapid evolution of LLMs.

Theoretical and Practical Implications

The paper makes significant contributions both theoretically and practically. Theoretically, it extends the evaluation of LLMs into the field of Game Theory, offering a new perspective on assessing artificial intelligence. Practically, the findings shed light on the strengths and limitations of current LLMs in complex decision-making scenarios, indicating areas for further enhancement. Moreover, the improvement strategies identified, such as the Chain-of-Thought prompting, suggest actionable paths to ameliorate LLMs' performance.

Future Directions

Looking ahead, the paper posits several avenues for future research:

  • Expanding the Framework: Incorporating more diverse and complex games could further deepen the understanding of LLMs' decision-making capabilities.
  • Cross-model Evaluations: Comparative studies across a broader range of models could unearth more insights into the generalizable aspects of LLM intelligence.
  • Enhancement Strategies: Exploring additional strategies for improving LLMs' generalizability and robustness in strategic decision-making remains a promising research domain.

In summary, this examination of LLMs' decision-making in multi-agent environments, through the lens of Game Theory, unveils critical insights into the capabilities and limitations of current models. It not only benchmarks their performance but also paves the way for future enhancements, promising a trajectory of rapid advancement in LLM intelligence and its applicability in complex decision-making scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Evaluating multi-agent coordination abilities in large language models. arXiv preprint arXiv:2310.03903, 2023.
  2. Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pp.  337–371. PMLR, 2023.
  3. Playing repeated games with large language models. arXiv preprint arXiv:2305.16867, 2023.
  4. W Brian Arthur. Inductive reasoning and bounded rationality. The American economic review, 84(2):406–411, 1994.
  5. Generalized divide the dollar. In 2016 IEEE Congress on Evolutionary Computation (CEC), pp.  343–350. IEEE, 2016.
  6. Education in the era of generative artificial intelligence (ai): Understanding the potential benefits of chatgpt in promoting teaching and learning. Journal of AI, 7(1):52–62, 2023.
  7. Playing games with gpt: What can we learn about a large language model from canonical strategic games? Available at SSRN 4493398, 2023.
  8. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  9. Put your money where your mouth is: Evaluating strategic planning and execution of llm agents in an auction arena. arXiv preprint arXiv:2310.05746, 2023.
  10. Gtbench: Uncovering the strategic reasoning limitations of llms via game-theoretic evaluations. arXiv preprint arXiv:2402.12348, 2024.
  11. Can large language models serve as rational players in game theory? a systematic analysis. arXiv preprint arXiv:2312.05488, 2023.
  12. The dynamics of social dilemmas. Scientific American, 270(3):76–81, 1994.
  13. Robert E Goodin. The theory of institutional design. Cambridge University Press, 1998.
  14. Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
  15. Fulin Guo. Gpt agents in game theory experiments. arXiv preprint arXiv:2305.05516, 2023.
  16. Suspicion-agent: Playing imperfect information games with theory of mind aware gpt-4. arXiv preprint arXiv:2309.17277, 2023.
  17. Strategic behavior of large language models: Game structure vs. contextual framing. Contextual Framing (September 10, 2023), 2023.
  18. John J Horton. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023.
  19. Bernardo A. Huberman. The Ecology of Computation. North-Holland, 1988.
  20. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 2023.
  21. Assessing the accuracy and reliability of ai-generated medical responses: an evaluation of the chat-gpt model. Research square, 2023.
  22. D Marc Kilgour. Equilibrium points of infinite sequential truels. International Journal of Game Theory, 6:167–180, 1977.
  23. The truel. Mathematics Magazine, 70(5):315–326, 1997.
  24. D Mark Kilgour. The sequential truel. International Journal of Game Theory, 4:151–174, 1975.
  25. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  26. Better zero-shot reasoning with role-play prompting. arXiv preprint arXiv:2308.07702, 2023.
  27. Michal Kosinski. Theory of mind might have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 2023.
  28. Llm-based agent society investigation: Collaboration and confrontation in avalon gameplay. arXiv preprint arXiv:2310.14985, 2023.
  29. Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155, 2023.
  30. Alain Ledoux. Concours résultats complets. les victimes se sont plu à jouer le 14 d’atout. Jeux & Stratégie, 2(10):10–11, 1981.
  31. Beyond static datasets: A deep interaction approach to llm evaluation. arXiv preprint arXiv:2309.04369, 2023.
  32. Leveraging word guessing games to assess the intelligence of large language models. arXiv preprint arXiv:2310.20499, 2023a.
  33. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118, 2023b.
  34. Alympics: Language agents meet game theory. arXiv preprint arXiv:2311.03220, 2023.
  35. Auctions and bidding. Journal of economic literature, 25(2):699–738, 1987.
  36. Roger B Myerson. Game theory. Harvard university press, 2013.
  37. Rosemarie Nagel. Unraveling in guessing games: An experimental study. The American economic review, 85(5):1313–1326, 1995.
  38. John F Nash. Equilibrium points in n-person games. Proceedings of the national academy of sciences, 36(1):48–49, 1950.
  39. John F Nash. Non-cooperative games. Annals of Mathematics, 54(2):286–295, 1951.
  40. Aidan O’Gara. Hoodwinked: Deception and cooperation in a text-based game for language models. arXiv preprint arXiv:2308.01404, 2023.
  41. Joseph Persky. Retrospectives: The ethology of homo economicus. Journal of Economic Perspectives, 9(2):221–231, 1995.
  42. Investigating emergent goal-like behaviour in large language models using experimental economics. arXiv preprint arXiv:2305.07970, 2023.
  43. Gameeval: Evaluating llms on conversational games. arXiv preprint arXiv:2308.10032, 2023.
  44. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476, 2023.
  45. Ariel Rubinstein. Instinctive and cognitive reasoning: A study of response times. The Economic Journal, 117(523):1243–1259, 2007.
  46. Paul A Samuelson. The pure theory of public expenditure. The review of economics and statistics, 36(4):387–389, 1954.
  47. Pure competition, coalitional power, and fair division. International Economic Review, 10(3):337–362, 1969.
  48. Long-horizon dialogue understanding for role identification in the game of avalon with large language models. arXiv preprint arXiv:2311.05720, 2023.
  49. Ian Stewart. A puzzle for pirates. Scientific American, 280(5):98–99, 1999.
  50. Nigar M Shafiq Surameery and Mohammed Y Shakor. Use chat gpt to solve programming bugs. International Journal of Information Technology & Computer Engineering (IJITC) ISSN: 2455-5290, 3(01):17–22, 2023.
  51. Can large language models play text games well? current state-of-the-art and open questions. arXiv preprint arXiv:2304.02868, 2023.
  52. William Vickrey. Counterspeculation, auctions, and competitive sealed tenders. The Journal of finance, 16(1):8–37, 1961.
  53. Avalon’s game of thoughts: Battle against deception through recursive contemplation. arXiv preprint arXiv:2310.01320, 2023.
  54. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  55. Chatgpt or grammarly? evaluating chatgpt on grammatical error correction benchmark. arXiv preprint arXiv:2303.13648, 2023a.
  56. Smartplay: A benchmark for llms as intelligent agents. arXiv preprint arXiv:2310.01557, 2023b.
  57. Magic: Investigation of large language model powered multi-agent in cognition, adaptability, rationality and collaboration. arXiv e-prints, pp.  arXiv–2311, 2023a.
  58. Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658, 2023b.
  59. Language agents with reinforcement learning for strategic play in the werewolf game. arXiv preprint arXiv:2310.18940, 2023c.
  60. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Jen-tse Huang (46 papers)
  2. Eric John Li (4 papers)
  3. Man Ho Lam (6 papers)
  4. Tian Liang (50 papers)
  5. Wenxuan Wang (128 papers)
  6. Youliang Yuan (18 papers)
  7. Wenxiang Jiao (44 papers)
  8. Xing Wang (191 papers)
  9. Zhaopeng Tu (135 papers)
  10. Michael R. Lyu (176 papers)
Citations (19)
Github Logo Streamline Icon: https://streamlinehq.com