Economics Arena for Large Language Models (2401.01735v1)
Abstract: LLMs have been extensively used as the backbones for general-purpose agents, and some economics literature suggest that LLMs are capable of playing various types of economics games. Following these works, to overcome the limitation of evaluating LLMs using static benchmarks, we propose to explore competitive games as an evaluation for LLMs to incorporate multi-players and dynamicise the environment. By varying the game history revealed to LLMs-based players, we find that most of LLMs are rational in that they play strategies that can increase their payoffs, but not as rational as indicated by Nash Equilibria (NEs). Moreover, when game history are available, certain types of LLMs, such as GPT-4, can converge faster to the NE strategies, which suggests higher rationality level in comparison to other models. In the meantime, certain types of LLMs can win more often when game history are available, and we argue that the winning rate reflects the reasoning ability with respect to the strategies of other players. Throughout all our experiments, we observe that the ability to strictly follow the game rules described by natural languages also vary among the LLMs we tested. In this work, we provide an economics arena for the LLMs research community as a dynamic simulation to test the above-mentioned abilities of LLMs, i.e. rationality, strategic reasoning ability, and instruction-following capability.
- Do as i can, not as i say: Grounding language in robotic affordances, 2022.
- Can we trust the evaluation on chatgpt?, 2023.
- Playing repeated games with large language models, 2023.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Anthropic. Releasing claude instant 1.2. https://www.anthropic.com/index/releasing-claude-instant-1-2, 2023a. Accessed: 2023-09-30.
- Anthropic. Claude 2. https://www.anthropic.com/index/claude-2, 2023b. Accessed: 2023-09-30.
- Emergent autonomous scientific research capabilities of large language models. arXiv preprint arXiv:2304.05332, 2023.
- Dialogue management in conversational systems: a review of approaches, challenges, and opportunities. IEEE Transactions on Cognitive and Developmental Systems, 2021.
- Playing games with gpt: What can we learn about a large language model from canonical strategic games? Available at SSRN 4493398, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335, 2022.
- Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023.
- Strategic reasoning with language models, 2023.
- Challenges in evaluating AI systems, 2023. URL https://www.anthropic.com/index/evaluating-ai-systems.
- Fulin Guo. Gpt agents in game theory experiments, 2023.
- Measuring massive multitask language understanding, 2021a.
- Measuring mathematical problem solving with the math dataset. NeurIPS, 2021b.
- John J Horton. Large language models as simulated economic agents: What can we learn from homo silicus? Technical report, National Bureau of Economic Research, 2023.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models, 2023.
- Algorithmic cooperation. Available at SSRN 4389647, 2023.
- Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
- Jonathan Levin. Auction theory. Manuscript available at www. stanford. edu/jdlevin/Econ, 20286, 2004.
- Holistic evaluation of language models, 2022.
- Learning to model the world with language, 2023.
- Chain of hindsight aligns language models with feedback, 2023a.
- Training socially aligned language models in simulated human society. arXiv preprint arXiv:2305.16960, 2023b.
- Norman Malcolm. Wittgenstein on language and rules. Philosophy, 64(247):5–28, 1989.
- Microeconomic theory, volume 1. Oxford university press New York, 1995.
- Herve Moulin. Game theory for the social sciences. NYU press, 1986.
- Rosemarie Nagel. Unraveling in guessing games: An experimental study. The American economic review, 85(5):1313–1326, 1995.
- Industrial engineering with large language models: A case study of chatgpt’s performance on oil & gas problems, 2023.
- OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022. Accessed: 2023-09-30.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback, 2022.
- Generative agents: Interactive simulacra of human behavior. In In the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY, USA, 2023. Association for Computing Machinery.
- Investigating emergent goal-like behaviour in large language models using experimental economics. arXiv preprint arXiv:2305.07970, 2023.
- Communicative agents for software development, 2023.
- Probabilistic language understanding: An introduction to the rational speech act framework. Retrieved January, 17:2021, 2018.
- Moral Sentiments. Chapter 7: Economic behavior and rationality. Boston University, 2019.
- Reflexion: Language agents with verbal reinforcement learning, 2023.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023.
- Cognitive architectures for language agents, 2023.
- Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- William Vickrey. Counterspeculation, auctions, and competitive sealed tenders. The Journal of finance, 16(1):8–37, 1961.
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
- A survey on large language model based autonomous agents, 2023b.
- Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023c.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.
- Chain-of-thought prompting elicits reasoning in large language models, 2023a.
- Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846, 2023b.
- Lilian Weng. Llm-powered autonomous agents. lilianweng.github.io, Jun 2023. URL https://lilianweng.github.io/posts/2023-06-23-agent/.
- Self-polish: Enhance reasoning in large language models via problem refinement, 2023.
- Towards autonomous system: flexible modular production system enhanced with large language model agents. arXiv preprint arXiv:2304.14721, 2023.
- Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658, 2023.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
- Tree of thoughts: Deliberate problem solving with large language models, 2023a.
- React: Synergizing reasoning and acting in language models, 2023b.
- Muhamet Yildiz. Economic applications of game theory, chapter 5 rationalizability, 2012. URL https://ocw.mit.edu/courses/14-12-economic-applications-of-game-theory-fall-2012/pages/lecture-notes/.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- Building cooperative embodied agents modularly with large language models. arXiv preprint arXiv:2307.02485, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Agieval: A human-centric benchmark for evaluating foundation models, 2023.
- Don’t make your llm an evaluation benchmark cheater, 2023.
- Shangmin Guo (18 papers)
- Haoran Bu (2 papers)
- Haochuan Wang (8 papers)
- Yi Ren (215 papers)
- Dianbo Sui (19 papers)
- Yuming Shang (4 papers)
- Siting Lu (2 papers)