Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents (2406.06613v2)

Published 7 Jun 2024 in cs.CL and cs.AI

Abstract: LLMs have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using LLMs in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents' performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worst GPT-4 performs worse than random action. CoT and RAP both improve scores but not comparable to human levels.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Anthony Costarelli (2 papers)
  2. Mat Allen (3 papers)
  3. Roman Hauksson (2 papers)
  4. Grace Sodunke (2 papers)
  5. Suhas Hariharan (4 papers)
  6. Carlson Cheng (1 paper)
  7. Wenjie Li (183 papers)
  8. Arjun Yadav (1 paper)
  9. Joshua Clymer (10 papers)
Citations (4)