Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

98 81

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments (2403.11807v4)

Published 18 Mar 2024 in cs.AI and cs.CL

Abstract: Decision-making is a complex process requiring diverse abilities, making it an excellent framework for evaluating LLMs. Researchers have examined LLMs' decision-making through the lens of Game Theory. However, existing evaluation mainly focus on two-player scenarios where an LLM competes against another. Additionally, previous benchmarks suffer from test set leakage due to their static design. We introduce GAMA($\gamma$)-Bench, a new framework for evaluating LLMs' Gaming Ability in Multi-Agent environments. It includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to quantitatively assess LLMs' performance. $\gamma$-Bench allows flexible game settings and adapts the scoring system to different game parameters, enabling comprehensive evaluation of robustness, generalizability, and strategies for improvement. Our results indicate that GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought. We also evaluate twelve LLMs from six model families, including GPT-3.5, GPT-4, Gemini, LLaMA-3.1, Mixtral, and Qwen-2. Gemini-1.5-Pro outperforms others, scoring of $68.1$ out of $100$, followed by LLaMA-3.1-70B ($64.5$) and Mixtral-8x22B ($61.4$). All code and experimental results are publicly available via https://github.com/CUHK-ARISE/GAMABench.

PDF HTML Abstract

Evaluating LLMs' Decision-Making in Multi-Agent Environments

Overview

LLMs have demonstrated remarkable capabilities across various tasks. However, evaluating these models' decision-making abilities, especially in complex scenarios involving multiple agents, remains a challenging frontier. This paper introduces a comprehensive framework designed to assess LLMs in the context of Game Theory, named $\gamma$ -Bench. It comprises eight classical multi-agent games, categorically grouped to analyze models' performance across different strategic nuances such as cooperation, competition, and mixed motives. The framework not only evaluates the decision-making prowess of LLMs but also provides insights into their robustness, generalizability, and potential improvement strategies. Notably, it reveals that while models like GPT-3.5 exhibit robust decision-making capabilities, their generalizability across different games remains constrained. The paper also highlights the apparent enhancement in decision-making abilities of subsequent LLM versions, with GPT-4 outperforming its predecessors.

Methodological Approach

The research meticulously crafts a scoring scheme tailored for quantitatively measuring LLMs' performance in the game-theoretic context. Key elements of the methodology include:

Framework Design: $\gamma$ -Bench incorporates eight strategically selected games, allowing for a nuanced analysis of LLMs' decision-making in scenarios involving cooperation, competition, and a blend of both.
Scoring Scheme: A novel scoring system is proposed to quantitatively assess LLMs' performance, focusing on their strategic soundness and effectiveness in various gaming contexts.
Robustness and Generalizability: The framework evaluates models' robustness in game strategy execution and their generalizability across different gaming setups.

Experimental Findings

The paper presents a thorough comparative analysis of several LLMs, including different versions of GPT-3.5 and GPT-4, through $\gamma$ -Bench. Some of the pivotal experimental findings are:

Performance Rankings: GPT-4 emerges as the top-performing model with a score of 72.5, outshining its predecessors and showcasing notable advancements in LLMs' decision-making abilities.
Robustness vs. Generalizability: While models like GPT-3.5 demonstrate substantial robustness in their strategic implementations, they exhibit limited generalizability across diverse game setups.
Version-wise Improvement: Sequential versions of GPT-3.5 show a progressive enhancement in intelligence and decision-making capability, illustrating the rapid evolution of LLMs.

Theoretical and Practical Implications

The paper makes significant contributions both theoretically and practically. Theoretically, it extends the evaluation of LLMs into the field of Game Theory, offering a new perspective on assessing artificial intelligence. Practically, the findings shed light on the strengths and limitations of current LLMs in complex decision-making scenarios, indicating areas for further enhancement. Moreover, the improvement strategies identified, such as the Chain-of-Thought prompting, suggest actionable paths to ameliorate LLMs' performance.

Future Directions

Looking ahead, the paper posits several avenues for future research:

Expanding the Framework: Incorporating more diverse and complex games could further deepen the understanding of LLMs' decision-making capabilities.
Cross-model Evaluations: Comparative studies across a broader range of models could unearth more insights into the generalizable aspects of LLM intelligence.
Enhancement Strategies: Exploring additional strategies for improving LLMs' generalizability and robustness in strategic decision-making remains a promising research domain.

In summary, this examination of LLMs' decision-making in multi-agent environments, through the lens of Game Theory, unveils critical insights into the capabilities and limitations of current models. It not only benchmarks their performance but also paves the way for future enhancements, promising a trajectory of rapid advancement in LLM intelligence and its applicability in complex decision-making scenarios.

PDF Markdown Bookmark Chat (Pro)

References (60)

Authors (10)

Jen-tse Huang (46 papers)
Eric John Li (4 papers)
Man Ho Lam (6 papers)
Tian Liang (50 papers)
Wenxuan Wang (128 papers)
Youliang Yuan (18 papers)
Wenxiang Jiao (44 papers)
Xing Wang (191 papers)
Zhaopeng Tu (135 papers)
Michael R. Lyu (176 papers)

Citations (19)

View on Semantic Scholar

GitHub

GitHub - CUHK-ARISE/GAMABench: Benchmarking LLMs' Gaming Ability in Multi-Agent Environments (81 stars)

Tweets

https://twitter.com/WenxiangJiao/status/1769979617694408875

https://twitter.com/JentseHuang/status/1829635292040679464

https://twitter.com/johnjhorton/status/1771198482709168162

https://twitter.com/infornomics/status/1916225163617603777

https://twitter.com/arxivsanitybot/status/1770632168404050171

https://twitter.com/JentseHuang/status/1805076332809683177