Papers
Topics
Authors
Recent
2000 character limit reached

RLVE-Gym: Adaptive RL for Language Models

Updated 13 November 2025
  • RLVE-Gym is a suite of procedurally generated, verifiable environments that dynamically adjust task difficulty for RL fine-tuning of language models.
  • It employs adaptive difficulty scheduling and deterministic verifiers to generate high-information training signals across diverse reasoning tasks.
  • Empirical results show significant improvements in generalization and compute efficiency compared to static RL data regimes.

RLVE-Gym is a large-scale suite of verifiable, procedurally generated environments engineered to support and scale the reinforcement learning (RL) fine-tuning of LMs on structured reasoning tasks. Each RLVE-Gym environment algorithmically verifies agent outputs and adaptively regulates problem difficulty according to the policy’s learning progress. The RLVE-Gym framework and its results delineate a new regime for RL with LMs, sharply contrasting with static corpus approaches, by enforcing curriculum adaptation and guaranteeing high-information learning signals across extended training.

1. Motivation and Theoretical Underpinnings

Traditional RL fine-tuning of LMs on static datasets exhibits two critical deficiencies: after initial progress, easy examples saturate and yield no gradient information due to uniformly high reward, while hard examples—beyond the agent’s current capacity—produce near-uniformly low reward, also failing to yield meaningful updates. RLVE-Gym operationalizes the RL with Adaptive Verifiable Environments (RLVE) paradigm by substituting a fixed distribution with hundreds of distinct, verifiable environments; each can generate families of tasks parametrized by a difficulty variable dd. At every training step, the environment dynamically selects tasks “on the boundary” of the model’s current ability. Verifiers—implemented as code—produce rewards Rp(o)[1,1]R_p(o) \in [-1,1] by checking the agent's output oo against canonical solutions or constraints.

2. Environment Construction and Problem Generation

The suite comprises 400 manually engineered environments organized around six primary families:

  • Programming-contest problems (e.g., evaluating permutation properties under sorting)
  • Symbolic mathematics (e.g., computing integrals)
  • Optimization (e.g., minimizing polynomial functions)
  • Algorithmic tasks (e.g., sorting arrays)
  • Logical and combinatorial puzzles (e.g., Sudoku)
  • NP-complete structures (e.g., Hamiltonian path finding in graphs)

Each environment E=(I,P,R)E = (I, P, R) is defined by an input template II, a generator Pn(d)P_n(d) producing problem instances as a function of difficulty dd, and a deterministic verifier Rp(o)R_p(o) that assigns scalar rewards. The environments are constructed so that solving a problem of difficulty d+1d+1 strictly subsumes the ability to solve all problems of lower dd, which allows for curriculum progression. Verification is typically achieved via constraint checking or by running a canonical solver, mapping the degree of solution correctness to a real-valued reward—e.g., via (correct/total)k(\text{correct}/\text{total})^k.

3. Adaptive Difficulty Scheduling

A core mechanism implemented by RLVE-Gym is adaptive adjustment of the problem difficulty for each individual environment. For every environment ii, RLVE-Gym maintains a difficulty window [π(i),hπ(i)][\ell^{(i)}_\pi, h^{(i)}_\pi] specific to the active policy π\pi. The upper bound hπ(i)h^{(i)}_\pi increments after the policy attains accuracy τacc\geq \tau_{\text{acc}} on problems at hπ(i)h^{(i)}_\pi across τnum\tau_{\text{num}} rollouts. The window’s width is bounded above by dΔd_\Delta to ensure ongoing exposure to tasks near the current competence frontier: π(i)max(π(i),hπ(i)dΔ+1)\ell^{(i)}_\pi \leftarrow \max(\ell^{(i)}_\pi,\, h^{(i)}_\pi - d_{\Delta} + 1) At each RL rollout, the system performs:

  • Uniformly sample environment ii.
  • Sample dd uniformly over [π(i),hπ(i)][\ell^{(i)}_\pi, h^{(i)}_\pi].
  • Generate pPd(i)p \sim P^{(i)}_d, then present IpI_p to the agent.

The effective sampling distribution at time tt for environment ii and difficulty dd is: pt(i,d)={1n1hπ(i)π(i)+1d[π(i),hπ(i)] 0otherwisep_t(i, d) = \begin{cases} \frac{1}{n} \cdot \frac{1}{h^{(i)}_\pi - \ell^{(i)}_\pi + 1} & d \in [\ell^{(i)}_\pi, h^{(i)}_\pi] \ 0 & \text{otherwise} \end{cases} Recommended hyperparameters are τacc=0.9\tau_{\text{acc}} = 0.9, τnum=8×(rollouts per prompt)\tau_{\text{num}} = 8 \times (\text{rollouts per prompt}), dΔ=4d_\Delta = 4.

4. Reward Verification and Batch Training Regimen

For every agent-proposed output oo, the environment's verifier RpR_p deterministically checks solution quality—e.g., via code execution for algorithms, constraint satisfaction for puzzles, or canonical solutions for math. Scalar rewards follow the [1,1][-1, 1] interval, with partial credit for near-correct outputs when appropriate. Training employs DAPO, a PPO-inspired optimizer with “dynamic sampling”: batches are oversampled (\sim384 prompts per batch of 128) and pruned by discarding prompts on which all 16 rollouts yield identical reward. Each batch involves 16 rollouts per prompt, ensuring exposure to a variety of difficulty levels and problem types. Joint training uniformly interleaves the 400 environments, adaptively sampling difficulty for each.

5. Empirical Outcomes and Scaling Law Observations

RLVE-Gym empirical evaluation encompasses multiple regimes, notably:

  • Data-saturation: Continuing RL fine-tuning of ProRL-1.5B-v2 on its static dataset for 3,600 H100-GPU hours yields a 0.49% average gain across six reasoning evaluation benchmarks, while RLVE-Gym joint training across 400 adaptive environments delivers a 3.37% average improvement using only \sim1,100 H100-hours.
  • Environment scaling: Systematic ablations varying the number of environments (1, 4, 16, 256, nested) demonstrate that adding environments strictly improves held-out OOD performance for all pretraining/fine-tuning initializations.
  • Difficulty adaptation: Adaptive difficulty scheduling maintains effective prompt ratios (fraction of rollouts contributing a learning signal) of \sim50–80%, vastly outperforming static interval or even oracle-chosen static ranges, as static regimes rapidly saturate the easy subset and stall.
  • Compute-constrained: RLVE-Gym outperforms static RL datasets (e.g., DeepMath-103K) by \sim2% OOD on six benchmarks, using the same compute and a base SFT-only LM.

These patterns suggest that RLVE-Gym achieves not only higher generalization but also superior compute efficiency through the continual supply of high-information, unsaturated training signals.

6. Environment Engineering and Practical Extension

Effective use of RLVE-Gym requires deliberate environment engineering, analogous to prompt or feature engineering in discriminative settings. Recommended practices include:

  • Preference for tasks with solving/verification asymmetry (e.g., Sudoku, symbolic integration, NP-complete puzzles) to harness efficient reward computation and scalable curriculum.
  • Structuring difficulty progression so that d+1d+1 instances strictly generalize dd.
  • Manual vetting of environments remains advisable to guarantee well-formedness and verifiability; automated generation via LMs is plausible but carries the danger of ungrammatical or unverifiable environments.
  • For extension to non-verifiable or subjective tasks, researchers may consider integrating learned reward models or human feedback, though adaptive curriculum construction in such domains is an open question.

The practical guideline is to monitor the effective prompt ratio and tune (τacc,τnum,dΔ)(\tau_{\text{acc}}, \tau_{\text{num}}, d_{\Delta}) to preserve the regime where rewards provide strong policy gradients.

7. Significance and Prospects

RLVE-Gym establishes a reproducible paradigm for scaling up RL-driven reasoning in LMs by providing algorithmically verified, adaptively leveled, and procedurally generated environment diversity. The marked empirical performance improvements over static RL data regimes—both in terms of absolute generalization gains and throughput per compute—underscore the utility of the RLVE-Gym approach for advancing generalizable symbolic and algorithmic reasoning. A plausible implication is that further increases in the diversity, verification robustness, or automatic extension of environments could catalyze additional progress, especially if strategies emerge for curriculum adaptation in environments where algorithmic verification is impossible or computationally expensive.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RLVE-Gym.