Papers
Topics
Authors
Recent
Search
2000 character limit reached

ReasoningGym: Evaluating LLM Reasoning

Updated 3 May 2026
  • ReasoningGym is a formal paradigm and open-source platform that procedurally generates multi-step reasoning tasks with verifiable reward signals to assess LLM capabilities.
  • It employs a robust architecture with a generator, verifier, and RL interface, enabling dynamic difficulty control and exact solution checking.
  • Its diverse applications—from algebra and logic to spatial puzzles and multi-agent games—offer a scalable framework for curriculum-driven model improvement.

ReasoningGym is a formal paradigm and open-source platform for evaluating and training the reasoning capabilities of LLMs and related agents through procedurally generated, multi-step environments with verifiable reward signals. Emphasizing algorithmic correctness, large solution spaces, and parametric difficulty, it moves beyond static QA benchmarks by embedding tasks in an interactive, reinforcement learning (RL)–compatible framework. Across core instantiations—such as the original Reasoning Gym suite, Multilingual Reasoning Gym, and knowledge-orthogonal extensions—ReasoningGym establishes a rigorous foundation for both assessment and curriculum-driven improvement of model reasoning in domains including algebra, logic, games, spatial puzzles, and multi-agent scenarios.

1. Foundations and Architecture

At its core, ReasoningGym structures each task as a tuple of (generator, verifier, RL interface), designed for seamless procedural instance creation and exact solution checking (Stojanovski et al., 30 May 2025). The platform’s design principles are:

  • Algorithmic Verifiability: Every problem comes with a deterministic ground-truth checker, ensuring reproducible, unambiguous rewards without human grading.
  • Large Solution Spaces: Environments are configured to support a diverse array of valid policies and intermediate reasoning traces, mitigating spurious reward hacking or memorization.
  • Parametric Difficulty Control: Difficulty and structure are tunable via explicit configuration, enabling both fine-grained curriculum learning and out-of-distribution evaluation.

A typical environment includes:

  • DataGenerator: Produces unique problem instances according to difficulty and other hyperparameters, outputting both the prompt (typically in natural language) and a metadata-encoded ground-truth solution.
  • Verifier: Receives the problem parameters and a candidate solution from the agent, applying fast, task-specific logic (e.g., integer comparison, graph traversal) to return a Boolean or scalar reward.
  • RL-API: Exposes an interface akin to OpenAI Gym, with reset() and step(action) methods for use with standard RL libraries.

A sample pseudocode interaction:

1
2
3
4
5
6
7
from reasoning_gym import make_env
env = make_env("mini_sudoku", config={"min_empty":8, "max_empty":12})
obs = env.reset()
done = False
while not done:
    s = agent.generate(obs)
    obs, reward, done, info = env.step(s)
(Stojanovski et al., 30 May 2025)

2. Task Domains, Instance Generation, and Verifier Design

The ReasoningGym suite covers 100+ environments spanning:

  • Algebra (e.g., equation solving, polynomial manipulation)
  • Algorithms (sorting, dynamic programming)
  • Arithmetic and number theory
  • Cognition/ARC-like pattern induction
  • Games and puzzles (e.g., mini Sudoku, N-Queens, Rubik’s Cube, Sokoban)
  • Geometry, graph theory, logic, code interpretation

All environments use parametric procedural generation: complexity is controlled via hyperparameters (e.g., size of input, degree, number of steps), and problem instances are sampled with a fixed RNG seed per experiment for reproducibility. The verifier always computes or knows the correct answer and applies the appropriate equivalence/acceptance logic.

Verifiers are implemented as task-specific Python routines—ranging from exact matching to graph isomorphism checking or tolerance-based comparison for geometric quantities. Reward functions are typically binary (1 for correctness, 0 otherwise), but composite rewards (e.g., accuracy plus formatting adherence) and per-step process incentives are supported:

rtotal=racc+αrfmtr_{total} = r_{acc} + \alpha \cdot r_{fmt}

(Stojanovski et al., 30 May 2025, Dobler et al., 11 Mar 2026)

3. Evaluation Protocols and Metrics

Evaluation in ReasoningGym environments is multi-faceted:

  • Task accuracy: Fraction of correct final answers across sampled (or exhaustive) instance sets at given difficulty.
  • Process metrics: Where available, per-step correctness (e.g., intermediate moves in puzzles or game states).
  • Faithfulness and verifiability: Notably, recent work introduces the Causal Importance of Reasoning (CIR) and Sufficiency of Reasoning (SR) metrics (Yu et al., 23 Apr 2026):

    • CIR quantifies the cumulative impact of the model’s reasoning tokens on the output answer distribution via Jensen-Shannon divergence:

    CIR=1Tk=1TJS(Bern(pk)Bern(pT))\mathrm{CIR} = \frac{1}{T} \sum_{k=1}^T \mathrm{JS}\big(\mathrm{Bern}(p_k) \parallel \mathrm{Bern}(p_T)\big)

    where pkp_k is the model’s answer probability given the first kk tokens of the reasoning trace. - SR tests whether an external verifier can determine the answer from the reasoning trace alone, by comparing decoded answers with and without the original question context.

Per-domain, per-difficulty, and cross-lingual accuracy gaps are tracked, as well as calibration against external benchmarks (MATH, Big-Bench Hard, MMLU-Pro Math, etc.).

4. RL-Driven Learning and Reward Schemes

ReasoningGym is constructed for RL with verifiable rewards (RLVR). The sample interaction loop is:

1
2
3
4
5
θ ← G(seed, difficulty)
prompt ← T(θ)
agent_output ŷ ← policy(prompt)
correctness ← V(θ, ŷ)
reward ← function(correctness, ŷ)
(Dobler et al., 11 Mar 2026, Stojanovski et al., 30 May 2025)

Outcome-based RL improves task accuracy, but—in the absence of targeted reward shaping—may not promote faithful or verifiable reasoning traces (Yu et al., 23 Apr 2026). Augmenting outcome rewards with CIR and/or SR feedback, or seeding the agent with a small set of expert reasoning chains (as in SFT), has been shown to jointly preserve accuracy and promote interpretability and causal engagement of reasoning steps.

Complex reward functions, including per-step process feedback, partial credit for subgoals, or penalties for unnecessary hint usage, are supported. This flexibility is critical for fine-tuning chain-of-thought generation (Dobler et al., 11 Mar 2026, Yu et al., 23 Apr 2026).

5. Multilingual, Multimodal, and Game-Based Extensions

ReasoningGym has been generalized along several axes:

  • Multilingual Reasoning Gym (Dobler et al., 11 Mar 2026): Template-driven translation of 94 tasks into 14 languages, with procedural instance alignment (same seed/parameters, language-specific prompts), enables systematic cross-lingual evaluation. All tasks are managed in parallel, with native validation in 10 languages. Accuracy declines outside English even for top models (e.g., Qwen3-14B shows 54.2% in English vs. 48.3% in Chinese at "easy" difficulty; 36.2% vs. 33.6% at "hard").
  • Spatial-Gym (Kaesberg et al., 10 Apr 2026): Focuses on spatial constraint reasoning (2D pathfinding puzzles) as explicit MDPs, supporting stepwise and backtracking interactions. Human solve rates are 98%; best LLM, GPT-OSS 120B, achieves only 16%. Incorporation of vision backbones or image input further degrades performance.
  • Knowledge-Orthogonal Extensions (KORGym) (Shi et al., 20 May 2025): Introduces 51 games parameterized along mathematical, puzzle, spatial/geometric, strategic, control-interactive, and multimodal axes. Each is an MDP with standardized RL APIs. Evaluations show closed-source "thinking" models outperform open-source variants, and text-based prompts outperform visual for LLMs without strong vision encoders.
  • Gamified Formal Reasoning Engines (Walter et al., 2021): Applications such as loop-invariant discovery reframe formal verification as collaborative or competitive interactive games, where procedural and structural feedback guide both human and agent participants through proof discovery with verifiable, logic-based rewards.
  • Modeling Complex Game Environments (Świechowski et al., 22 Feb 2026, Mishra et al., 11 Jun 2025): Incorporation of forward-simulation games, symbolic logic challenges, and multi-agent settings (TTT-Bench, General Game Playing, GameArena) exposes reasoning gaps—especially in multistep planning, abstraction, and strategic foresight—despite strong single-step or mathematical reasoning performance.

6. Empirical Findings and Model Behavior

  • Task performance is highest for models explicitly trained on verifiable-reward environments. Zero-shot accuracy on easy tasks can reach 60-65% for specialized models, but falls precipitously with increased difficulty or when modalities or languages change (Stojanovski et al., 30 May 2025, Dobler et al., 11 Mar 2026).
  • RLVR robustly boosts accuracy; however, unless coupled with auxiliary process-reward shaping (CIR/SR, SFT with expert traces), it does not guarantee reasoning traces are used or verifiable (Yu et al., 23 Apr 2026).
  • Cross-lingual and cross-domain transfer is quantifiable: RLVR on algorithmic tasks improves out-of-domain accuracy (e.g., +29.1% on algebra, +22.3% on geometry).
  • Curriculum learning—advancing difficulty adaptively as accuracy passes thresholds—yields gains over static training schedules (+40.7% on spelling backwards; +13.3% on Mini Sudoku at increased emptiness).
  • Failure modes include post-hoc rationalization (collapsed CIR/SR with high accuracy), overlong but uninformative chains-of-thought, reliance on pattern matching rather than symbolic manipulation, and domain-specific errors (e.g., hallucinated GDL rules, illegal actions in formal games).

7. Broader Implications and Open Directions

ReasoningGym paradigm enables robust, contamination-resistant, and fine-grained assessment of model reasoning, particularly where chain-of-thought or modular policies are desired. Procedural, verifiable, multi-step testbeds avoid the pitfalls of data leakage or overfitting to static benchmarks. Research shows that metrics such as CIR and SR are essential complements to standard accuracy, actualizing progress in model transparency and verifiability (Yu et al., 23 Apr 2026).

Frontiers include:

By making all tasks open-source, fully procedural, and compatible with standard RL infrastructure, ReasoningGym and its descendants provide the infrastructure for systematic, scalable advances in machine reasoning research, human–AI collaborative problem solving, and critical evaluation of reasoning trace faithfulness and verifiability (Stojanovski et al., 30 May 2025, Yu et al., 23 Apr 2026, Dobler et al., 11 Mar 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ReasoningGym.