Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reasoning Gym: Procedural RL Benchmark

Updated 30 June 2025
  • Reasoning Gym is an open-source library that generates unlimited, verifiable reasoning tasks to train and evaluate reinforcement learning models.
  • It pairs dynamically generated tasks with algorithmic verifiers, ensuring reproducible, objective evaluations in domains such as algebra, logic, and games.
  • Its modular design and curriculum learning options enable continuous skill improvement and cross-domain research in scalable model training.

Reasoning Gym (RG) is a library of procedurally generated reasoning environments designed for reinforcement learning with verifiable rewards. It provides a scalable, modular infrastructure for evaluating and training reasoning models—such as LLMs—across a wide variety of domains. The system uniquely enables infinite data generation with controllable complexity and offers algorithmic verification of solutions, supporting both rigorous benchmarking and scalable RL-based model improvement.

1. Scope and Foundational Innovations

Reasoning Gym addresses limitations present in fixed reasoning datasets by procedurally generating an unlimited variety of tasks, each paired with a programmatic verifier for objective scoring. The library supports over 100 generators and verifiers, with domains including—algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and a spectrum of games. This structure enables:

  • Continuous and curriculum learning without risk of data memorization.
  • Fine-grained control over task presentation and complexity.
  • Completely automated, reproducible evaluation pipelines.

The integration of algorithmically verifiable rewards—where every answer can be checked by code for correctness—facilitates RL-driven training at scale, eliminating ambiguity in supervision and evaluation.

2. Domains, Task Generators, and Verifiers

RG’s architecture is built on pairs of generators (which sample unique task instances via parameterized algorithms) and verifiers (which provide binary or graded correctness signals).

Primary domains and representative task types:

Category Description Example Tasks
Algebra Symbolic manipulation, variable solving polynomial_equations, integration
Algorithms Stepwise, computational procedures base_conversion, spell_backward
Arithmetic Numeric operations and puzzles fraction_simplification, cryptarithm
Cognition+ARC Pattern/analogy, visual-matrix reasoning arc_1d, figlet_font
Code Simple program interpretation bf, codeio
Games Logic puzzles, constraint reasoning sudoku, rubiks_cube, n_queens
Geometry Spatial/coordinate logic advanced_geometry, simple_geometry
Graphs Structural traversal, search/shortest path maze, shortest_path
Induction Sequence completion and regularities modulo_grid, number_sequence
Logic Deduction and proof knights_knaves, syllogism

For each, procedural generation controls attributes such as size, structure, and difficulty. For example, in complex_arithmetic, generators can randomize real/imaginary part ranges and selected operations. The verifier ensures answer validity, e.g., for a complex subtraction or Sudoku grid.

Example Generator and Verifier (excerpt)

1
2
3
min_real = -10; max_real = 10
min_imag = -10; max_imag = 10
operations = ('+', '-', '*', '/')

3. Procedural Generation and Curricular Dynamics

Reasoning Gym supports sampling novel tasks on-demand, parameterized by difficulty and structure.

  • Difficulty parameters: e.g., degree of polynomial, board size, or rule complexity.
  • Structural parameters: e.g., number of nodes in a graph, n for n-queens puzzles.
  • Stylistic parameters: e.g., presentation format varied to ensure robust generalization.

This enables curriculum and capability progression studies by increasing challenge as models advance, and allows continuous adaptation to probe emerging model strengths and limitations.

Key procedural features:

  • No ceiling on dataset size.
  • Unrepeated tasks in every epoch, precluding memorization.
  • Adjustable, domain-spanning task design for targeted evaluation.

4. Evaluation Methodology and Impact

RG supports both static evaluation and RL-based learning paradigms with verifiable, automated scoring:

  • Reward structure: Most tasks are binary-scored by verifiers (correct/incorrect), with auxiliary reward metrics (such as output formatting) allowed in complex domains.

R=accuracy+0.2format rewardR = \text{accuracy} + 0.2 \cdot \text{format reward}

  • Training and curriculum: YAML-like configs govern dataset weighting, parameter intervals, curriculum scheduling (e.g., auto-increase of difficulty), and batching.

Empirical results:

  • State-of-the-art reasoning models (o3-mini, DeepSeek-R1) outperform generic LLMs by >20% absolute accuracy on hard RG tasks; even these models struggle to reach 50% on visual/spatial games, indicating high challenge and novelty.
  • As task parameters increase in complexity, performance drops sharply (e.g., up to −71.9% for code tasks).
  • RL on RG (intra-domain and cross-domain) yields substantial accuracy gains—for example, algorithm-task RL improves held-out Algebra tasks by up to 12%.
  • Cross-domain RL-trained models show transfer: e.g., Algorithms → (29% gain in Algebra), Games → Cognition.
  • Curriculum RL leads to faster, more robust skill acquisition than randomly mixed training.

5. Technical Implementation

Each RG task consists of:

  • Configuration: Ranges for sampling, e.g., minimum and maximum Sudoku grid size.
  • Generator code: Deterministically produces a new unique sample each call.
  • Verifier logic: Returns correctness flag for a submitted solution.

Training config example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
datasets:
  base_conversion:
    weight: 1
  spell_backward:
    weight: 1
    config:
      min_word_len: 3
      max_word_len: 10
curriculum:
  enabled: True
  schedule:
    automatic: True
    update_steps: 30
    success_threshold: 0.70
reward:
  use_accuracy: True
  secondary_rewards:
    - name: format
      scaling_factor: 0.2
Reward logic is always grounded in explicit verifier checks: Verifier(S)={1if S=S 0otherwise\texttt{Verifier}(S) = \begin{cases} 1 \quad \text{if } S = S^* \ 0 \quad \text{otherwise} \end{cases}

6. Implications and Future Research Directions

The design of Reasoning Gym has significant implications for the field:

  • Open-ended RLVR evaluation and training: Enables objective benchmarking and continuous curriculum training without reliance on scraped or pre-crafted internet corpora.
  • Fine-grained diagnostic utility: Researchers can probe specific subskills and capability boundaries, adapting data generation to highlight emergent behaviors or failure modes.
  • Scalable, transferable RL: Training on RG tasks yields gains on established math (e.g., MATH, GSM8K) benchmarks, supporting the utility of verifiable, procedurally generated reasoning data.
  • Catalyst for new RL and continual learning research: RG’s flexible infrastructure supports research on model merging, replay, lifelong learning, and robust RL reward design.
  • Planned extensions: The roadmap includes multi-turn and multimodal task support (e.g., vision-language games), non-stationary evaluation streams, human feedback integration, and harder, more open-ended creative domains.

7. Summary Table: RG Features

Aspect Details
Procedural Generation Infinite, unique tasks via parameterized generators
Verifiable Rewards Each task paired with an algorithmic correctness checker
Domain Breadth Algebra, arithmetic, logic, computation, games, etc
Curriculum Support Difficulty/structure adjustment for progressive training
RL & Evaluation Supports RL, zero-shot eval, cross-domain/curriculum studies
Open-source Infrastructure Training, configs, and eval fully reproducible

Reasoning Gym constitutes an open-ended, modular, and algorithmically robust environment for both training and evaluating reasoning models with reinforcement learning. Its procedural foundation enables scalable benchmarking, mitigates memorization, and supports continual advances in the evaluation and improvement of general reasoning agents.