Measuring General Intelligence with Generated Games (2505.07215v1)
Abstract: We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in LLMs. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a LLM to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate LLMs by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.
- Vivek Verma (4 papers)
- David Huang (14 papers)
- William Chen (49 papers)
- Dan Klein (99 papers)
- Nicholas Tomlin (10 papers)