Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Measuring General Intelligence with Generated Games (2505.07215v1)

Published 12 May 2025 in cs.AI

Abstract: We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in LLMs. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a LLM to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate LLMs by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Vivek Verma (4 papers)
  2. David Huang (14 papers)
  3. William Chen (49 papers)
  4. Dan Klein (99 papers)
  5. Nicholas Tomlin (10 papers)

Summary

Measuring General Intelligence with Generated Games: An Overview

The paper presents gg-bench, a novel benchmarking framework for evaluating general intelligence in LLMs through generated game environments. This framework departs from static benchmarks by introducing a dynamic system that continuously produces evaluation instances, leveraging the capabilities of LLMs not only to participate but also to generate complex tasks beyond their reach. The benchmark is designed to assess the adaptability and reasoning capabilities of LLMs, particularly in unforeseen contexts.

Approach and Methodology

gg-bench is created via three primary steps:

  1. LLMs generate natural language descriptions of unique two-player games.
  2. The games are implemented as Gym environments using LLM-generated code.
  3. Reinforcement learning (RL) agents are trained to play these games through self-play.

To evaluate a LLM, the paper employs a straightforward measure: its winrate against trained RL agents when provided with the game description, current state, and valid moves.

Key Findings and Numerical Results

gg-bench presents a formidable challenge to contemporary LLMs. The paper reports that state-of-the-art LLMs like GPT-4o and Claude 3.7 Sonnet achieve winrates between 7-9%, whereas reasoning models such as DeepSeek-R1 and OpenAI's o-series exhibit substantially improved winrates, ranging from 31-36%. These findings underscore the necessity of structured decision-making and long-horizon planning—facets emphasized by reasoning models.

Implications and Future Directions

The dynamic nature of gg-bench promises several advantages. It provides scale; as models evolve, more robust games can be synthesized, preventing benchmark saturation. Additionally, its synthetic foundation ensures control over evaluation, permitting tailored alterations to test specific model attributes. Moreover, by focusing on LLM-generated games, gg-bench addresses the risk of models leveraging memorization from training, testing genuine adaptability.

The paper demonstrates the potential of using self-generated tasks to avoid the pitfalls of static benchmarks which may not reflect genuine domain-independent capabilities. While this approach cannot encompass all aspects of general intelligence—such as social reasoning—it presents a scalable proxy for evaluating model performance in novel and diverse settings.

As AI research progresses, the methodology put forth in this paper hints at future benchmarks' scalability and adaptability, forming a robust framework to assess the evolving landscape of artificial general intelligence.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com