REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards (2505.24760v1)

Published 30 May 2025 in cs.LG, cs.AI, and cs.CL

Abstract: We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.

Summary

The paper introduces an open-source library with over 100 procedurally-generated tasks that enable automatic reward verification for scalable RL training.
Experiments reveal significant zero-shot performance gaps and a 'Difficulty Cliff' phenomenon, highlighting challenges for current LLMs on complex reasoning tasks.
RLVR training on these tasks demonstrates robust intra- and cross-domain skill transfer, leading to notable improvements on established benchmarks.

Reasoning Gym (RG) (2505.24760) is introduced as a comprehensive library of procedurally generated reasoning environments specifically designed to address the data scarcity bottleneck for training LLMs with Reinforcement Learning with Verifiable Rewards (RLVR). Unlike traditional reasoning datasets which are fixed, RG provides over 100 algorithmically verifiable tasks across diverse domains like algebra, arithmetic, geometry, algorithms, logic, cognition, games, graphs, and induction. These tasks can generate virtually infinite training data instances with adjustable complexity, enabling continuous evaluation and dynamic curriculum learning.

The core design principles of RG are:

Algorithmic Verifiability: Every task allows for automatic, unambiguous reward calculation, which is crucial for scalable RLVR training.
Large Solution Spaces: Tasks are designed to encourage generalizable strategies rather than memorization or overfitting to specific instances.
Parametric Difficulty Control: Each task includes configurable parameters that systematically control characteristics such as size, constraints, and depth, allowing for fine-grained difficulty adjustment. These parameters fall into Difficulty, Structural, and Stylistic categories. For example, difficulty parameters might control the number of nodes in graph problems or polynomial degrees in algebra.

RG includes tasks ranging from complex_arithmetic and prime_factorization in mathematics, spiral_matrix and string_manipulation in algorithms, arc_1d and figlet_font for pattern recognition, mini_sudoku and rubiks_cube in games, shortest_path for graphs, and knights_knaves for logic. Detailed examples of these tasks, including their parameters and input/output formats, are provided in the paper's appendix. For instance, the spiral_matrix task takes a matrix as input and requires the output of elements in spiral order, with parameters like min_n and max_n controlling matrix size.

The authors conducted experiments evaluating frontier LLMs zero-shot performance on RG tasks and using RG for RLVR training.

Zero-Shot Performance:

Evaluation of state-of-the-art LLMs on RG reveals significant challenges, particularly on harder configurations.

Reasoning-optimized models (like o3-mini, DeepSeek-R1) consistently outperform general-purpose models (like Llama 4 Maverick, Claude 3.5 Sonnet), showing a substantial performance gap (around 22%).
A "Difficulty Cliff" phenomenon is observed, where performance drops dramatically as task difficulty increases (Figure 4). This is most severe in domains like code generation (-62%), algorithms (-28%), and graphs (-30%).
Tasks requiring visual-spatial reasoning represented in text (cognition, games) are particularly challenging for all models (Figure 5).

Skill Transfer and Generalization (RLVR Training):

RLVR training on RG tasks demonstrated both intra-domain and cross-domain skill transfer.

Intra-Domain Transfer: Training a model (Qwen2.5-3B-Instruct) using GRPO on a composite of tasks within a reasoning category (e.g., Algorithmic) improves performance on held-out tasks from the same category (Table 1, Figure 6). This was observed across Algebra, Algorithmic, Arithmetic, Cognition, and Games categories, with significant gains even in categories where the baseline model had zero initial performance (e.g., Games: +3.3%).
Cross-Domain Transfer: Training on tasks from one category improved performance on tasks from different categories (Table 2, Figure 7). For example, models trained on algorithmic tasks showed notable gains in Algebra (+29.1%) and Geometry (+22.3%). Logic training enhanced performance in Cognition (+13.3%) and Graphs (+9.1%). Games training improved performance on Algebra (+21.8%) and Cognition (+13.1%).
External Benchmark Transfer: Training models (Llama-3.2-3B-Instruct and Qwen2.5-3B-Instruct) on a composite of RG mathematical tasks resulted in improved performance on established benchmarks like GSM8K and MATH (Table 3). Qwen2.5-3B-Instruct showed a particularly strong improvement (+9.7%) on the MATH benchmark.

Curriculum RLVR:

Implementing a simple curriculum (gradually increasing word length) for the spell_backward task showed that curriculum learning accelerates training and leads to better final performance compared to fixed difficulty training (Figure 8).

Implementation Details:

The RG library is open-source and available on GitHub. It provides task generators, verifiers, and training infrastructure. The appendix includes detailed configuration parameters for generating easy and hard versions of many tasks, giving practitioners concrete values to replicate experiments or build their own curricula. A sample training configuration using the verl library with RG-specific settings is also provided in the appendix (Section A.4). This config shows how to specify the desired datasets, their weights in the training distribution, curriculum settings, reward configurations, and standard RL training parameters.

Limitations and Future Work:

The paper acknowledges limitations, including the difficulty of creating procedural generators for tasks requiring extensive domain knowledge or creativity, potential gaps in current verification functions compared to human judgment, and the current focus on single-turn, text-based reasoning. Future work could explore multi-turn and multimodal reasoning, investigate continual learning settings to address catastrophic forgetting, and develop better procedural generators and verification mechanisms.

In conclusion, RG offers a scalable, verifiable, and diverse environment for training and evaluating reasoning capabilities in LLMs using RLVR. Its procedural generation capability addresses key limitations of fixed datasets and facilitates research into curriculum learning and skill transfer across different reasoning domains. Practitioners can leverage the open-source library and provided configurations to train and benchmark reasoning models on a wide array of challenges.