Reasoning Gym: Procedural RL Benchmark
- Reasoning Gym is an open-source library that generates unlimited, verifiable reasoning tasks to train and evaluate reinforcement learning models.
- It pairs dynamically generated tasks with algorithmic verifiers, ensuring reproducible, objective evaluations in domains such as algebra, logic, and games.
- Its modular design and curriculum learning options enable continuous skill improvement and cross-domain research in scalable model training.
Reasoning Gym (RG) is a library of procedurally generated reasoning environments designed for reinforcement learning with verifiable rewards. It provides a scalable, modular infrastructure for evaluating and training reasoning models—such as LLMs—across a wide variety of domains. The system uniquely enables infinite data generation with controllable complexity and offers algorithmic verification of solutions, supporting both rigorous benchmarking and scalable RL-based model improvement.
1. Scope and Foundational Innovations
Reasoning Gym addresses limitations present in fixed reasoning datasets by procedurally generating an unlimited variety of tasks, each paired with a programmatic verifier for objective scoring. The library supports over 100 generators and verifiers, with domains including—algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and a spectrum of games. This structure enables:
- Continuous and curriculum learning without risk of data memorization.
- Fine-grained control over task presentation and complexity.
- Completely automated, reproducible evaluation pipelines.
The integration of algorithmically verifiable rewards—where every answer can be checked by code for correctness—facilitates RL-driven training at scale, eliminating ambiguity in supervision and evaluation.
2. Domains, Task Generators, and Verifiers
RG’s architecture is built on pairs of generators (which sample unique task instances via parameterized algorithms) and verifiers (which provide binary or graded correctness signals).
Primary domains and representative task types:
Category | Description | Example Tasks |
---|---|---|
Algebra | Symbolic manipulation, variable solving | polynomial_equations, integration |
Algorithms | Stepwise, computational procedures | base_conversion, spell_backward |
Arithmetic | Numeric operations and puzzles | fraction_simplification, cryptarithm |
Cognition+ARC | Pattern/analogy, visual-matrix reasoning | arc_1d, figlet_font |
Code | Simple program interpretation | bf, codeio |
Games | Logic puzzles, constraint reasoning | sudoku, rubiks_cube, n_queens |
Geometry | Spatial/coordinate logic | advanced_geometry, simple_geometry |
Graphs | Structural traversal, search/shortest path | maze, shortest_path |
Induction | Sequence completion and regularities | modulo_grid, number_sequence |
Logic | Deduction and proof | knights_knaves, syllogism |
For each, procedural generation controls attributes such as size, structure, and difficulty. For example, in complex_arithmetic, generators can randomize real/imaginary part ranges and selected operations. The verifier ensures answer validity, e.g., for a complex subtraction or Sudoku grid.
Example Generator and Verifier (excerpt)
1 2 3 |
min_real = -10; max_real = 10 min_imag = -10; max_imag = 10 operations = ('+', '-', '*', '/') |
3. Procedural Generation and Curricular Dynamics
Reasoning Gym supports sampling novel tasks on-demand, parameterized by difficulty and structure.
- Difficulty parameters: e.g., degree of polynomial, board size, or rule complexity.
- Structural parameters: e.g., number of nodes in a graph, n for n-queens puzzles.
- Stylistic parameters: e.g., presentation format varied to ensure robust generalization.
This enables curriculum and capability progression studies by increasing challenge as models advance, and allows continuous adaptation to probe emerging model strengths and limitations.
Key procedural features:
- No ceiling on dataset size.
- Unrepeated tasks in every epoch, precluding memorization.
- Adjustable, domain-spanning task design for targeted evaluation.
4. Evaluation Methodology and Impact
RG supports both static evaluation and RL-based learning paradigms with verifiable, automated scoring:
- Reward structure: Most tasks are binary-scored by verifiers (correct/incorrect), with auxiliary reward metrics (such as output formatting) allowed in complex domains.
- Training and curriculum: YAML-like configs govern dataset weighting, parameter intervals, curriculum scheduling (e.g., auto-increase of difficulty), and batching.
Empirical results:
- State-of-the-art reasoning models (o3-mini, DeepSeek-R1) outperform generic LLMs by >20% absolute accuracy on hard RG tasks; even these models struggle to reach 50% on visual/spatial games, indicating high challenge and novelty.
- As task parameters increase in complexity, performance drops sharply (e.g., up to −71.9% for code tasks).
- RL on RG (intra-domain and cross-domain) yields substantial accuracy gains—for example, algorithm-task RL improves held-out Algebra tasks by up to 12%.
- Cross-domain RL-trained models show transfer: e.g., Algorithms → (29% gain in Algebra), Games → Cognition.
- Curriculum RL leads to faster, more robust skill acquisition than randomly mixed training.
5. Technical Implementation
Each RG task consists of:
- Configuration: Ranges for sampling, e.g., minimum and maximum Sudoku grid size.
- Generator code: Deterministically produces a new unique sample each call.
- Verifier logic: Returns correctness flag for a submitted solution.
Training config example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
datasets: base_conversion: weight: 1 spell_backward: weight: 1 config: min_word_len: 3 max_word_len: 10 curriculum: enabled: True schedule: automatic: True update_steps: 30 success_threshold: 0.70 reward: use_accuracy: True secondary_rewards: - name: format scaling_factor: 0.2 |
6. Implications and Future Research Directions
The design of Reasoning Gym has significant implications for the field:
- Open-ended RLVR evaluation and training: Enables objective benchmarking and continuous curriculum training without reliance on scraped or pre-crafted internet corpora.
- Fine-grained diagnostic utility: Researchers can probe specific subskills and capability boundaries, adapting data generation to highlight emergent behaviors or failure modes.
- Scalable, transferable RL: Training on RG tasks yields gains on established math (e.g., MATH, GSM8K) benchmarks, supporting the utility of verifiable, procedurally generated reasoning data.
- Catalyst for new RL and continual learning research: RG’s flexible infrastructure supports research on model merging, replay, lifelong learning, and robust RL reward design.
- Planned extensions: The roadmap includes multi-turn and multimodal task support (e.g., vision-language games), non-stationary evaluation streams, human feedback integration, and harder, more open-ended creative domains.
7. Summary Table: RG Features
Aspect | Details |
---|---|
Procedural Generation | Infinite, unique tasks via parameterized generators |
Verifiable Rewards | Each task paired with an algorithmic correctness checker |
Domain Breadth | Algebra, arithmetic, logic, computation, games, etc |
Curriculum Support | Difficulty/structure adjustment for progressive training |
RL & Evaluation | Supports RL, zero-shot eval, cross-domain/curriculum studies |
Open-source Infrastructure | Training, configs, and eval fully reproducible |
Reasoning Gym constitutes an open-ended, modular, and algorithmically robust environment for both training and evaluating reasoning models with reinforcement learning. Its procedural foundation enables scalable benchmarking, mitigates memorization, and supports continual advances in the evaluation and improvement of general reasoning agents.