GLEE: A Unified Framework and Benchmark for Language-based Economic Environments (2410.05254v1)

Published 7 Oct 2024 in cs.CL, cs.AI, cs.CY, cs.GT, and cs.LG

Abstract: LLMs show significant potential in economic and strategic interactions, where communication via natural language is often prevalent. This raises key questions: Do LLMs behave rationally? Can they mimic human behavior? Do they tend to reach an efficient and fair outcome? What is the role of natural language in the strategic interaction? How do characteristics of the economic environment influence these dynamics? These questions become crucial concerning the economic and societal implications of integrating LLM-based agents into real-world data-driven systems, such as online retail platforms and recommender systems. While the ML community has been exploring the potential of LLMs in such multi-agent setups, varying assumptions, design choices and evaluation criteria across studies make it difficult to draw robust and meaningful conclusions. To address this, we introduce a benchmark for standardizing research on two-player, sequential, language-based games. Inspired by the economic literature, we define three base families of games with consistent parameterization, degrees of freedom and economic measures to evaluate agents' performance (self-gain), as well as the game outcome (efficiency and fairness). We develop an open-source framework for interaction simulation and analysis, and utilize it to collect a dataset of LLM vs. LLM interactions across numerous game configurations and an additional dataset of human vs. LLM interactions. Through extensive experimentation, we demonstrate how our framework and dataset can be used to: (i) compare the behavior of LLM-based agents to human players in various economic contexts; (ii) evaluate agents in both individual and collective performance measures; and (iii) quantify the effect of the economic characteristics of the environments on the behavior of agents.

PDF HTML Abstract

Overview of "GLEE: A Unified Framework and Benchmark for Language-based Economic Environments"

The paper introduces GLEE, a comprehensive framework designed to evaluate LLMs in language-based economic settings. This research addresses the increasing intersection of LLMs, economics, and multi-agent systems, presenting a standardized benchmark for assessing LLM behavior in economic interactions.

Main Contributions

The authors propose a structured framework for modeling interactions in three economic game families: bargaining, negotiation, and persuasion. Each game type is meticulously defined, considering strategic behavior with nuanced parameters such as game horizon, information structure, and communication form. The framework supports both LLM-LLM and human-LLM interactions, generating a significant dataset for research.

Key elements of the framework include:

Game Families and Parametrization: Definitions of bargaining, negotiation, and persuasion games are influenced by economic theory. Parameters like discount factors, subjective valuations, and communication dynamics are crucial.
Dataset and Methodology: Extensive data is collected from 954K LLM-LLM games across diverse configurations using models like Qwen-2, Gemini, and Llama. Human interaction data was also gathered using a developed interface.
Evaluation Metrics: The paper introduces metrics such as self-gain, efficiency, and fairness to capture both individual agent performance and overall game outcomes. These metrics provide insights into the economic rationality and strategic effectiveness of LLMs.

Numerical Results and Analysis

Significant results highlighted include:

Performance Across Metrics: The authors found that LLMs generally outperform humans in negotiation scenarios, while humans excel in certain bargaining contexts. Efficiency and fairness varied depending on game configuration and agent roles.
Impact of Game Parameters: Analyzing metrics revealed how changes in parameters, like the use of textual versus structured messages, affect outcomes. For instance, textual communication improves efficiency in negotiation settings.

Implications and Future Directions

The implications of this work are profound for both theoretical and practical developments in AI:

Economic Interaction Understanding: GLEE offers a window into how LLMs can simulate economic behavior, aiding the development of agents capable of rational decision-making in complex environments.
Benchmarking and Standardization: This framework provides a standardized benchmark, facilitating comparative analysis across studies and enabling generalization of LLM capabilities and limitations.
Data-Driven Insights: The extensive dataset allows for rich analysis, potentially improving economic models and LLM training methods.

Future research is encouraged to expand GLEE's applicability, possibly incorporating more complex multi-agent dynamics or adapting to other economic paradigms.

In conclusion, this paper contributes a robust and flexible tool for understanding LLM performance in economic settings, offering a solid foundation for advancing AI in real-world applications.