Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

110 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Reasoning Gym: Procedural RL Benchmark

Updated 30 June 2025

Reasoning Gym is an open-source library that generates unlimited, verifiable reasoning tasks to train and evaluate reinforcement learning models.
It pairs dynamically generated tasks with algorithmic verifiers, ensuring reproducible, objective evaluations in domains such as algebra, logic, and games.
Its modular design and curriculum learning options enable continuous skill improvement and cross-domain research in scalable model training.

Reasoning Gym (RG) is a library of procedurally generated reasoning environments designed for reinforcement learning with verifiable rewards. It provides a scalable, modular infrastructure for evaluating and training reasoning models—such as LLMs—across a wide variety of domains. The system uniquely enables infinite data generation with controllable complexity and offers algorithmic verification of solutions, supporting both rigorous benchmarking and scalable RL-based model improvement.

1. Scope and Foundational Innovations

Reasoning Gym addresses limitations present in fixed reasoning datasets by procedurally generating an unlimited variety of tasks, each paired with a programmatic verifier for objective scoring. The library supports over 100 generators and verifiers, with domains including—algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and a spectrum of games. This structure enables:

Continuous and curriculum learning without risk of data memorization.
Fine-grained control over task presentation and complexity.
Completely automated, reproducible evaluation pipelines.

The integration of algorithmically verifiable rewards—where every answer can be checked by code for correctness—facilitates RL-driven training at scale, eliminating ambiguity in supervision and evaluation.

2. Domains, Task Generators, and Verifiers

RG’s architecture is built on pairs of generators (which sample unique task instances via parameterized algorithms) and verifiers (which provide binary or graded correctness signals).

Primary domains and representative task types:

Category	Description	Example Tasks
Algebra	Symbolic manipulation, variable solving	polynomial_equations, integration
Algorithms	Stepwise, computational procedures	base_conversion, spell_backward
Arithmetic	Numeric operations and puzzles	fraction_simplification, cryptarithm
Cognition+ARC	Pattern/analogy, visual-matrix reasoning	arc_1d, figlet_font
Code	Simple program interpretation	bf, codeio
Games	Logic puzzles, constraint reasoning	sudoku, rubiks_cube, n_queens
Geometry	Spatial/coordinate logic	advanced_geometry, simple_geometry
Graphs	Structural traversal, search/shortest path	maze, shortest_path
Induction	Sequence completion and regularities	modulo_grid, number_sequence
Logic	Deduction and proof	knights_knaves, syllogism

For each, procedural generation controls attributes such as size, structure, and difficulty. For example, in complex_arithmetic, generators can randomize real/imaginary part ranges and selected operations. The verifier ensures answer validity, e.g., for a complex subtraction or Sudoku grid.

Example Generator and Verifier (excerpt)

1
2
3

min_real = -10; max_real = 10
min_imag = -10; max_imag = 10
operations = ('+', '-', '*', '/')

3. Procedural Generation and Curricular Dynamics

Reasoning Gym supports sampling novel tasks on-demand, parameterized by difficulty and structure.

Difficulty parameters: e.g., degree of polynomial, board size, or rule complexity.
Structural parameters: e.g., number of nodes in a graph, n for n-queens puzzles.
Stylistic parameters: e.g., presentation format varied to ensure robust generalization.

This enables curriculum and capability progression studies by increasing challenge as models advance, and allows continuous adaptation to probe emerging model strengths and limitations.

Key procedural features:

No ceiling on dataset size.
Unrepeated tasks in every epoch, precluding memorization.
Adjustable, domain-spanning task design for targeted evaluation.

4. Evaluation Methodology and Impact

RG supports both static evaluation and RL-based learning paradigms with verifiable, automated scoring:

Reward structure: Most tasks are binary-scored by verifiers (correct/incorrect), with auxiliary reward metrics (such as output formatting) allowed in complex domains.

$R = \text{accuracy} + 0.2 \cdot \text{format reward}$

Training and curriculum: YAML-like configs govern dataset weighting, parameter intervals, curriculum scheduling (e.g., auto-increase of difficulty), and batching.

Empirical results:

State-of-the-art reasoning models (o3-mini, DeepSeek-R1) outperform generic LLMs by >20% absolute accuracy on hard RG tasks; even these models struggle to reach 50% on visual/spatial games, indicating high challenge and novelty.
As task parameters increase in complexity, performance drops sharply (e.g., up to −71.9% for code tasks).
RL on RG (intra-domain and cross-domain) yields substantial accuracy gains—for example, algorithm-task RL improves held-out Algebra tasks by up to 12%.
Cross-domain RL-trained models show transfer: e.g., Algorithms → (29% gain in Algebra), Games → Cognition.
Curriculum RL leads to faster, more robust skill acquisition than randomly mixed training.

5. Technical Implementation

Each RG task consists of:

Configuration: Ranges for sampling, e.g., minimum and maximum Sudoku grid size.
Generator code: Deterministically produces a new unique sample each call.
Verifier logic: Returns correctness flag for a submitted solution.

Training config example:

datasets:
  base_conversion:
    weight: 1
  spell_backward:
    weight: 1
    config:
      min_word_len: 3
      max_word_len: 10
curriculum:
  enabled: True
  schedule:
    automatic: True
    update_steps: 30
    success_threshold: 0.70
reward:
  use_accuracy: True
  secondary_rewards:
    - name: format
      scaling_factor: 0.2

Reward logic is always grounded in explicit verifier checks:

\texttt{Verifier}(S) = \begin{cases} 1 \quad \text{if } S = S^* \ 0 \quad \text{otherwise} \end{cases}

6. Implications and Future Research Directions

The design of Reasoning Gym has significant implications for the field:

Open-ended RLVR evaluation and training: Enables objective benchmarking and continuous curriculum training without reliance on scraped or pre-crafted internet corpora.
Fine-grained diagnostic utility: Researchers can probe specific subskills and capability boundaries, adapting data generation to highlight emergent behaviors or failure modes.
Scalable, transferable RL: Training on RG tasks yields gains on established math (e.g., MATH, GSM8K) benchmarks, supporting the utility of verifiable, procedurally generated reasoning data.
Catalyst for new RL and continual learning research: RG’s flexible infrastructure supports research on model merging, replay, lifelong learning, and robust RL reward design.
Planned extensions: The roadmap includes multi-turn and multimodal task support (e.g., vision-language games), non-stationary evaluation streams, human feedback integration, and harder, more open-ended creative domains.

7. Summary Table: RG Features

Aspect	Details
Procedural Generation	Infinite, unique tasks via parameterized generators
Verifiable Rewards	Each task paired with an algorithmic correctness checker
Domain Breadth	Algebra, arithmetic, logic, computation, games, etc
Curriculum Support	Difficulty/structure adjustment for progressive training
RL & Evaluation	Supports RL, zero-shot eval, cross-domain/curriculum studies
Open-source Infrastructure	Training, configs, and eval fully reproducible

Reasoning Gym constitutes an open-ended, modular, and algorithmically robust environment for both training and evaluating reasoning models with reinforcement learning. Its procedural foundation enables scalable benchmarking, mitigates memorization, and supports continual advances in the evaluation and improvement of general reasoning agents.

PDF Markdown Chat (Pro)