gg-bench: Data-Generating Reasoning Benchmark

Updated 19 November 2025

gg-bench is a data-generating benchmark that evaluates language model reasoning via procedurally generated two-player games with automated rulebooks and environments.
It employs a three-stage LLM-driven pipeline to generate game descriptions, code, and competitive RL agents, ensuring scalable and diverse evaluation.
Win rate metrics demonstrate that reasoning-tuned models significantly outperform non-reasoning LLMs, underscoring the benchmark's discriminative power.

gg-bench is a data-generating benchmark designed to measure general reasoning capabilities in LLMs via the medium of novel, procedurally generated two-player games. Unlike static evaluation suites, gg-bench does not constitute a finite dataset but rather a stochastic pipeline, where new game environments, along with their implementations and competitive agents, are created on-demand through LLMs and reinforcement learning. Evaluation is conducted by measuring the win rate of LLMs against robust RL agents trained on the generated game suite, providing a stringent and scalable test of systematic reasoning and generalization in unfamiliar, formal domains (Verma et al., 12 May 2025).

1. Benchmark Design and Data-Generating Process

At its core, gg-bench defines a randomized game-generation pipeline. The process is formalized as a function

$G : \Omega \times \{\theta_{\mathrm{LLM}}, \theta_{\mathrm{RL}}\} \to (d, e, \pi_{\mathrm{RL}})$

where $\Omega$ denotes the space of random seeds, $\theta_{\mathrm{LLM}}$ the LLM parameters used for game and code creation, and $\theta_{\mathrm{RL}}$ the hyperparameters for policy learning. For each sampled seed $\omega \in \Omega$ :

$d(\omega; \theta_{\mathrm{LLM}})$ : A natural-language description ("rulebook") of a novel two-player, turn-based, zero-draw game is generated by prompting an LLM.
$e(d; \theta_{\mathrm{LLM}})$ : The LLM writes Python code implementing the game as an OpenAI Gym environment with specific interface: CustomEnv(gym.Env) supporting step, reset, render, and valid_moves methods.
$\pi_{\mathrm{RL}} = \text{TrainSelfPlay}(e; \theta_{\mathrm{RL}})$ : Using PPO, a self-play RL policy is trained to near-optimality within the generated environment.

Each tuple $(d, e, \pi_{\mathrm{RL}})$ is a "game instance" in gg-bench. This process ensures that new, diverse games can be synthesized in unlimited quantity, each equipped with a robust automated adversary for reliable model assessment (Verma et al., 12 May 2025).

2. Game Properties, Sampling, and Complexity Constraints

gg-bench games are constrained to two-player, fully observable, zero-sum, discrete-action, bounded-horizon settings. Random seeds sampled from $\Omega$ yield diverse game designs through unconstrained LLM-based generation, provided that generated games:

Terminate within a maximum of $H_{\max} = 100$ moves (episode horizon).
Possess a finite action space with empirical upper bound $A_{\max} \approx 2{,}500$ distinct actions per state.

The total state-space cardinality for a given game $g$ is thus bounded by

$|S_g| \leq \sum_{t=0}^{H_{\max}} |A_g|^t = O(A_{\max}^{H_{\max}})$

ensuring episodes always end via timeout if not previously terminated. This hard cap on both horizon and action cardinality ensures computational tractability and consistent evaluation across highly varied games (Verma et al., 12 May 2025).

3. Evaluation Methodology and Metrics

Model evaluation follows a two-agent competitive protocol:

The rulebook $d$ and the code-derived action-index mapping are provided as a system prompt.
On each turn, the LLM is prompted with the current rendered board state and a list of valid action indices, and must select an action by index. Outputs not corresponding to a legal action are corrected by a single re-prompt.
Matches are played against the self-play RL policy $\pi_{\mathrm{RL}}$ over $126$ sampled games, with $30$ matches per game instance.

The principal metric is empirical win rate: $W(\pi, \pi_{\mathrm{RL}}) = \mathbb{E}_{g \sim G} [\mathbf{1}\{\pi \text{ beats } \pi_{\mathrm{RL}}\text{ in }g\}]$ Statistical reporting uses the mean and 95% confidence intervals over all matches. Results show non-reasoning LLMs (LLaMA-3.3-70B, GPT-4o, Claude 3.7 Sonnet) achieve win rates of 7–9%, while reasoning-tuned models (o3-mini, DeepSeek-R1, o1) reach 31–36% (Verma et al., 12 May 2025).

Model/Track	Win Rate (%)	95% CI
LLaMA-3.3-70B	7.42	±2.78
GPT-4o-mini	7.64	±2.26
GPT-4o	8.94	±2.77
Claude 3.7 Sonnet	9.53	±3.05
o3-mini	31.08	±5.73
DeepSeek-R1	32.50	±5.14
o1	36.28	±5.95

The consistently low scores of non-reasoning, in-context LLMs compared to models explicitly tuned for reasoning attest to the benchmark's difficulty and discriminative power.

4. Generation Pipeline and Implementation Details

The architecture of gg-bench encompasses three critical LLM-driven transformation stages:

Game description → rulebook: LLM is instructed to invent a two-player, turn-based console game (barring common games like Go or Chess, disallowing draws), detailing all mechanics, objectives, components, scoring rules, and sample play-by-play sequences.
Rulebook → Gym code: The model constructs a compliant Python Gym environment based on the rulebook, following a canonical interface template (action space, state space, step logic, rendering, legal action enumeration).
Action-index assignment: The model provides an explicit mapping from integer action indices to game moves, used for deterministic inference-time mapping.

RL agent training employs PPO with clipped surrogate loss and generalized advantage estimation (GAE). Default hyperparameters include: learning rate $3\times 10^{-4}$ , $\gamma=0.99$ , $\lambda=0.95$ , clip=0.2, batch=64, rollouts=2048, 1e6 timesteps, $\varepsilon$ -greedy exploration from 1.0 to 0.1. For inference, each state is evaluated using 100 self-play MCTS (Monte Carlo Tree Search) rollouts, with actions chosen by maximal visit count.

LLM agents are evaluated using system and per-turn prompts. Invalid move indices trigger a single corrective re-prompt before the game is scored.

5. Scope, Limitations, and Future Extensions

Current gg-bench coverage is restricted to two-player, zero-sum, fully observed, discrete-action, bounded-horizon games. Social, cooperative, or partially observed games remain out of scope. Some generated environments may encode small, hard-coded constants; extending beyond these may require careful game-specific code review.

Future directions explicitly articulated include:

Support for multi-player and non-zero-sum games.
Incorporation of hidden information, stochasticity, or complex multi-phase/hierarchical action spaces.
Tightening or relaxing environment complexity by generator parameterization or by employing stronger LLMs and RL agents.
Direct arena-style LLM versus LLM play to mitigate RL agent training overhead and to benchmark relative LLM agent capability head-to-head.
Adjusting difficulty by varying RL checkpoint strength or by curated selection for specific win-rate targets.

A plausible implication is that gg-bench's procedural nature ensures resistance to data contamination and potential for open-ended expansion as both the environment generator (LLM) and baseline RL agents improve.

6. Significance and Comparative Positioning

gg-bench addresses a notable gap in current evaluation methodologies for LLM reasoning: the lack of scalable, procedurally generated, general-reasoning benchmarks with verifiable solution pathways in complex, novel environments. Benchmarking via competitive games with rigorously trained RL agents provides an adversarial, objective framework largely immune to overfitting and contamination concerns, as the benchmark can be entirely resampled.

This contrasts benchmark frameworks such as GGBench for geometric reasoning (Wei et al., 14 Nov 2025), which focus on symbolic-gen reasoning in spatial tasks but do not provide an open-ended, generative adversarial testbed. gg-bench's architecture offers unique potential for cross-domain transfer, curriculum development, and dynamical difficulty adjustment relative to static or single-domain alternatives.

gg-bench thereby constitutes a principal instrument for evaluating general reasoning, with extensibility, robustness, and sensitivity to architectural and algorithmic advances in LLM research.