Papers
Topics
Authors
Recent
2000 character limit reached

gg-bench: Data-Generating Reasoning Benchmark

Updated 19 November 2025
  • gg-bench is a data-generating benchmark that evaluates language model reasoning via procedurally generated two-player games with automated rulebooks and environments.
  • It employs a three-stage LLM-driven pipeline to generate game descriptions, code, and competitive RL agents, ensuring scalable and diverse evaluation.
  • Win rate metrics demonstrate that reasoning-tuned models significantly outperform non-reasoning LLMs, underscoring the benchmark's discriminative power.

gg-bench is a data-generating benchmark designed to measure general reasoning capabilities in LLMs via the medium of novel, procedurally generated two-player games. Unlike static evaluation suites, gg-bench does not constitute a finite dataset but rather a stochastic pipeline, where new game environments, along with their implementations and competitive agents, are created on-demand through LLMs and reinforcement learning. Evaluation is conducted by measuring the win rate of LLMs against robust RL agents trained on the generated game suite, providing a stringent and scalable test of systematic reasoning and generalization in unfamiliar, formal domains (Verma et al., 12 May 2025).

1. Benchmark Design and Data-Generating Process

At its core, gg-bench defines a randomized game-generation pipeline. The process is formalized as a function

G:Ω×{θLLM,θRL}(d,e,πRL)G : \Omega \times \{\theta_{\mathrm{LLM}}, \theta_{\mathrm{RL}}\} \to (d, e, \pi_{\mathrm{RL}})

where Ω\Omega denotes the space of random seeds, θLLM\theta_{\mathrm{LLM}} the LLM parameters used for game and code creation, and θRL\theta_{\mathrm{RL}} the hyperparameters for policy learning. For each sampled seed ωΩ\omega \in \Omega:

  • d(ω;θLLM)d(\omega; \theta_{\mathrm{LLM}}): A natural-language description ("rulebook") of a novel two-player, turn-based, zero-draw game is generated by prompting an LLM.
  • e(d;θLLM)e(d; \theta_{\mathrm{LLM}}): The LLM writes Python code implementing the game as an OpenAI Gym environment with specific interface: CustomEnv(gym.Env) supporting step, reset, render, and valid_moves methods.
  • πRL=TrainSelfPlay(e;θRL)\pi_{\mathrm{RL}} = \text{TrainSelfPlay}(e; \theta_{\mathrm{RL}}): Using PPO, a self-play RL policy is trained to near-optimality within the generated environment.

Each tuple (d,e,πRL)(d, e, \pi_{\mathrm{RL}}) is a "game instance" in gg-bench. This process ensures that new, diverse games can be synthesized in unlimited quantity, each equipped with a robust automated adversary for reliable model assessment (Verma et al., 12 May 2025).

2. Game Properties, Sampling, and Complexity Constraints

gg-bench games are constrained to two-player, fully observable, zero-sum, discrete-action, bounded-horizon settings. Random seeds sampled from Ω\Omega yield diverse game designs through unconstrained LLM-based generation, provided that generated games:

  • Terminate within a maximum of Hmax=100H_{\max} = 100 moves (episode horizon).
  • Possess a finite action space with empirical upper bound Amax2,500A_{\max} \approx 2{,}500 distinct actions per state.

The total state-space cardinality for a given game gg is thus bounded by

Sgt=0HmaxAgt=O(AmaxHmax)|S_g| \leq \sum_{t=0}^{H_{\max}} |A_g|^t = O(A_{\max}^{H_{\max}})

ensuring episodes always end via timeout if not previously terminated. This hard cap on both horizon and action cardinality ensures computational tractability and consistent evaluation across highly varied games (Verma et al., 12 May 2025).

3. Evaluation Methodology and Metrics

Model evaluation follows a two-agent competitive protocol:

  • The rulebook dd and the code-derived action-index mapping are provided as a system prompt.
  • On each turn, the LLM is prompted with the current rendered board state and a list of valid action indices, and must select an action by index. Outputs not corresponding to a legal action are corrected by a single re-prompt.
  • Matches are played against the self-play RL policy πRL\pi_{\mathrm{RL}} over $126$ sampled games, with $30$ matches per game instance.

The principal metric is empirical win rate: W(π,πRL)=EgG[1{π beats πRL in g}]W(\pi, \pi_{\mathrm{RL}}) = \mathbb{E}_{g \sim G} [\mathbf{1}\{\pi \text{ beats } \pi_{\mathrm{RL}}\text{ in }g\}] Statistical reporting uses the mean and 95% confidence intervals over all matches. Results show non-reasoning LLMs (LLaMA-3.3-70B, GPT-4o, Claude 3.7 Sonnet) achieve win rates of 7–9%, while reasoning-tuned models (o3-mini, DeepSeek-R1, o1) reach 31–36% (Verma et al., 12 May 2025).

Model/Track Win Rate (%) 95% CI
LLaMA-3.3-70B 7.42 ±2.78
GPT-4o-mini 7.64 ±2.26
GPT-4o 8.94 ±2.77
Claude 3.7 Sonnet 9.53 ±3.05
o3-mini 31.08 ±5.73
DeepSeek-R1 32.50 ±5.14
o1 36.28 ±5.95

The consistently low scores of non-reasoning, in-context LLMs compared to models explicitly tuned for reasoning attest to the benchmark's difficulty and discriminative power.

4. Generation Pipeline and Implementation Details

The architecture of gg-bench encompasses three critical LLM-driven transformation stages:

  1. Game description → rulebook: LLM is instructed to invent a two-player, turn-based console game (barring common games like Go or Chess, disallowing draws), detailing all mechanics, objectives, components, scoring rules, and sample play-by-play sequences.
  2. Rulebook → Gym code: The model constructs a compliant Python Gym environment based on the rulebook, following a canonical interface template (action space, state space, step logic, rendering, legal action enumeration).
  3. Action-index assignment: The model provides an explicit mapping from integer action indices to game moves, used for deterministic inference-time mapping.

RL agent training employs PPO with clipped surrogate loss and generalized advantage estimation (GAE). Default hyperparameters include: learning rate 3×1043\times 10^{-4}, γ=0.99\gamma=0.99, λ=0.95\lambda=0.95, clip=0.2, batch=64, rollouts=2048, 1e6 timesteps, ε\varepsilon-greedy exploration from 1.0 to 0.1. For inference, each state is evaluated using 100 self-play MCTS (Monte Carlo Tree Search) rollouts, with actions chosen by maximal visit count.

LLM agents are evaluated using system and per-turn prompts. Invalid move indices trigger a single corrective re-prompt before the game is scored.

5. Scope, Limitations, and Future Extensions

Current gg-bench coverage is restricted to two-player, zero-sum, fully observed, discrete-action, bounded-horizon games. Social, cooperative, or partially observed games remain out of scope. Some generated environments may encode small, hard-coded constants; extending beyond these may require careful game-specific code review.

Future directions explicitly articulated include:

  • Support for multi-player and non-zero-sum games.
  • Incorporation of hidden information, stochasticity, or complex multi-phase/hierarchical action spaces.
  • Tightening or relaxing environment complexity by generator parameterization or by employing stronger LLMs and RL agents.
  • Direct arena-style LLM versus LLM play to mitigate RL agent training overhead and to benchmark relative LLM agent capability head-to-head.
  • Adjusting difficulty by varying RL checkpoint strength or by curated selection for specific win-rate targets.

A plausible implication is that gg-bench's procedural nature ensures resistance to data contamination and potential for open-ended expansion as both the environment generator (LLM) and baseline RL agents improve.

6. Significance and Comparative Positioning

gg-bench addresses a notable gap in current evaluation methodologies for LLM reasoning: the lack of scalable, procedurally generated, general-reasoning benchmarks with verifiable solution pathways in complex, novel environments. Benchmarking via competitive games with rigorously trained RL agents provides an adversarial, objective framework largely immune to overfitting and contamination concerns, as the benchmark can be entirely resampled.

This contrasts benchmark frameworks such as GGBench for geometric reasoning (Wei et al., 14 Nov 2025), which focus on symbolic-gen reasoning in spatial tasks but do not provide an open-ended, generative adversarial testbed. gg-bench's architecture offers unique potential for cross-domain transfer, curriculum development, and dynamical difficulty adjustment relative to static or single-domain alternatives.

gg-bench thereby constitutes a principal instrument for evaluating general reasoning, with extensibility, robustness, and sensitivity to architectural and algorithmic advances in LLM research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to gg-bench.