ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models (2505.24864v1)

Published 30 May 2025 in cs.CL and cs.AI

Abstract: Recent advances in reasoning-centric LLMs have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in LLMs and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

Authors (8)

Mingjie Liu (26 papers)
Shizhe Diao (48 papers)
Ximing Lu (52 papers)
Jian Hu (40 papers)
Xin Dong (90 papers)
Yejin Choi (287 papers)
Jan Kautz (215 papers)
Yi Dong (46 papers)

Summary

This paper introduces ProRL (Prolonged Reinforcement Learning) (Liu et al., 30 May 2025 ), a training methodology designed to expand the reasoning capabilities of LLMs beyond what is latent in the base model. The authors challenge the notion that Reinforcement Learning (RL) merely amplifies high-reward outputs already present in a base model's distribution. They demonstrate that with sufficient training duration and appropriate techniques, RL can enable models to discover novel reasoning strategies.

The core contributions of the paper are:

ProRL Methodology: A novel training recipe combining:
- KL Divergence Control: A KL penalty term $D_{KL}(\pi_\theta || \pi_{ref})$ is added to the GRPO objective to maintain entropy and prevent the online policy $\pi_\theta$ from drifting too far from a reference policy $\pi_{ref}$ . This helps stabilize learning. The overall objective is $L_{KL-RL}(\theta) = L_{GRPO}(\theta) - \beta D_{KL}(\pi_\theta || \pi_{ref})$ .
- Reference Policy Resetting: Periodically, the reference policy $\pi_{ref}$ is hard-reset to a recent snapshot of the online policy $\pi_\theta$ , and optimizer states are reinitialized. This prevents the KL term from overly dominating the loss and allows for continued improvement.
- Diverse Task Suite: Training on a wide range of tasks (math, code, STEM, logic puzzles, instruction following) to promote generalization.
- Group Relative Policy Optimization (GRPO): Used as the core RL algorithm, which optimizes $\mathcal{L}_{\text{GRPO}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} [\min(r_\theta(\tau) A(\tau), \text{clip}(r_\theta(\tau), 1 - \epsilon, 1 + \epsilon) A(\tau))]$, where $A(\tau)$ is an advantage estimated from group scores without a value model.
- DAPO Enhancements: Incorporates techniques from Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) (Yu et al., 18 Mar 2025 ), such as decoupled clipping ( $\epsilon_{low}, \epsilon_{high}$ ) and dynamic sampling (filtering out too easy/hard prompts).
Nemotron-Research-Reasoning-Qwen-1.5B: A 1.5B parameter reasoning model trained using ProRL, starting from DeepSeek-R1-Distill-Qwen-1.5B. It was trained on a diverse dataset of 136K problems.
Empirical Evidence: The paper shows that:
- ProRL-trained models consistently outperform base models, even in scenarios where base models fail entirely (pass@k=0 for any k).
- Reasoning improvements scale with training duration (over 2000 steps).
- The model generates novel solutions, as measured by a higher Creativity Index.
- The Nemotron-Research-Reasoning-Qwen-1.5B model significantly surpasses its base model and even matches or outperforms the larger DeepSeek-R1-7B on several benchmarks.

ProRL Methodology Details

To combat entropy collapse during prolonged RL, ProRL employs several strategies:

High Rollout Temperature: Encourages initial exploration (e.g., temperature of 1.2).
Decoupled Clipping (from DAPO): Uses separate lower ( $1 - \epsilon_{low}$ ) and upper ( $1 + \epsilon_{high}$ ) clipping bounds, with $\epsilon_{high} > \epsilon_{low}$ to promote uplifting probabilities of unlikely tokens.
Dynamic Sampling (from DAPO): Filters prompts where the model consistently succeeds or fails, focusing on intermediate difficulty examples.
KL Regularization: The KL penalty $D_{KL}(\pi_\theta || \pi_{ref})$ is crucial for stability and sustained entropy, especially when starting from a well-initialized Chain-of-Thought (CoT) capable model.
Reference Policy Reset: When validation performance stagnates or the KL term dominates, $\pi_{ref}$ is reset to a recent $\pi_\theta$ , and the optimizer state is reinitialized. This is a key enabler for prolonged training.

Training Nemotron-Research-Reasoning-Qwen-1.5B

Base Model: DeepSeek-R1-Distill-Qwen-1.5B.
Training Data: A diverse dataset of 136K examples from five domains:
- Math: 40k problems from DeepScaleR Dataset, binary reward via math-verify.
- Code: 24k problems from Eurus-2-RL Dataset, continuous reward based on test cases passed.
- STEM: 25k problems from SCP-116K (Lu et al., 26 Jan 2025 ), filtered using GPT-4o, binary reward.
- Logical Puzzles: 37k synthetic samples from Reasoning Gym (96 tasks), continuous reward from task-specific verifiers.
- Instruction Following: 10k synthetic prompts from Llama-Nemotron, continuous reward.
Training Setup:
- Framework: verl [Sheng_2025].
- Optimizer: AdamW with learning rate $2 \times 10^{-6}$ .
- Rollout: $n=16$ responses per prompt (sometimes increased to 32), context window initially 8096 (later 16k), sampling temperature 1.2.
- Batch Size: 256 (mini-batch 64, 4 gradient updates per rollout).
- Hardware: 4 nodes of 8 x NVIDIA H100-80GB GPUs, ~16k GPU hours.
- Context Window: Mostly 8k tokens, increased to 16k in the final ~200 steps.
Training Dynamics: The training involved multiple "runs" or stages, with reference model and optimizer resets performed when validation performance (monitored on a blended set) stagnated. Hyperparameters, data mix, and reward shaping (e.g., penalizing non-terminating responses) were adjusted between runs.

Evaluation and Results

Benchmarks:
- Math: AIME24, AIME25, AMC, MATH, Minerva, Olympiad.
- Code: APPS, Codecontests, Codeforces, TACO, HumanevalPlus, LiveCodeBench.
- STEM Reasoning: GPQA Diamond.
- Instruction Following: IFEval.
- Logic Puzzles: Reasoning Gym (reserved test sets).
- OOD Tasks (Reasoning Gym): acre, boxnet, game_of_life_halting.
Inference Settings: vLLM backend, temperature 0.6, top_p=0.95, max length 32k. Pass@1 estimated from 16 samples for binary reward tasks.
Key Performance Gains (Nemotron-Research-Reasoning-Qwen-1.5B vs. base model):
- Math: +15.7% average pass@1.
- Code: +14.4% average pass@1.
- STEM (GPQA Diamond): +25.9%.
- Instruction Following (IFEval): +22.0%.
- Logic Puzzles (Reasoning Gym): +54.8%.
- The 1.5B ProRL model often matched or surpassed the 7B DeepSeek-R1-Distill-Qwen model.
- Outperformed domain-specialized models like DeepScaleR-1.5B (+4.6% in math) and DeepCoder-1.5B (+6.5% in code).

Analysis: Eliciting New Reasoning Patterns

"The Weaker the Start, the Stronger the Gain": A strong negative correlation was found between the base model's pass@128 on a task and the improvement gained from ProRL. RL expands reasoning boundaries most effectively where the base model initially struggles. Tasks where the base model already performed well showed minimal gains in reasoning breadth (pass@128) and sometimes even regression, suggesting RL primarily sharpened existing preferences. These high-performing base model tasks often had lower creativity indices, indicating more overlap with pretraining data.
Reasoning Boundary Evolution (Pass@k curves):
- Diminish: Some tasks (especially math) showed improved pass@1 but decreased pass@128, aligning with prior work. RL narrows the output distribution.
- Plateau: Gains in both pass@1 and pass@128 achieved early in RL, with little further improvement from prolonged training.
- Sustained Gains: Some tasks (especially complex coding) showed continued improvement in both pass@1 and pass@k with prolonged RL, indicating expansion of reasoning boundaries.
Out-of-Distribution (OOD) Generalization:
- Boxnet Task: Base model had 0% success. ProRL model achieved significant success, and prolonged training amplified these gains.
- Increased Task Difficulty (Graph Color): Trained on 10-node graphs, tested on larger graphs. ProRL model maintained higher accuracy on more complex, unseen instances compared to base and intermediate RL models.
Pass@1 Distribution Shifts: ProRL led to significant rightward shifts in pass@1 distributions for tasks like Codeforces and novel reasoning challenges (e.g., family_relationships), where base model accuracy was near zero.

Conclusion

ProRL demonstrates that extended and stable RL training can develop novel reasoning patterns beyond a base model's initial capabilities. The combination of KL divergence penalties, periodic reference policy resets, and diverse task training is crucial. The findings suggest that RL is particularly effective for tasks where the base model is weak, and can lead to generalization to OOD tasks and more complex problems. This challenges previous assumptions about RL's limitations in expanding reasoning boundaries and highlights the importance of sufficient training compute and appropriate techniques.

Practical Implementation Insights

Starting Point: Begin with a base LLM already fine-tuned for CoT generation if possible. DeepSeek-R1-Distill-Qwen-1.5B was used.
RL Algorithm: GRPO is a viable option, especially with DAPO modifications (decoupled clipping: $\epsilon_{low}=0.2, \epsilon_{high}=0.4$ , dynamic sampling).
Stability is Key for Prolonged Training:
- The KL divergence penalty ( $L_{KL-RL}(\theta) = L_{GRPO}(\theta) - \beta D_{KL}(\pi_\theta || \pi_{ref})$ ) is critical.
- Regularly resetting the reference policy $\pi_{ref}$ to a recent actor policy $\pi_\theta$ and reinitializing optimizer states is essential to prevent training stagnation or KL dominance. This allows the policy to continue evolving.
Dataset Diversity: A broad mix of tasks (math, code, logic, STEM, instruction following) with verifiable rewards is important for generalization.
- Math: Let's think step by step, output \boxed{answer}. Binary reward.
- Code: Enclose code in triple backticks. Continuous reward (fraction of test cases passed).
- Logic Puzzles: Enclose answer in <answer> </answer>. Continuous reward.
Monitoring: A blended validation set mirroring evaluation benchmarks helps track progress and decide when to reset the reference policy or adjust hyperparameters.
Rollout Configuration:
- Sample multiple responses per prompt (e.g., $n=16$ or $n=32$ ).
- Use a relatively high sampling temperature during rollouts (e.g., 1.2) to encourage exploration.
Computational Cost: Prolonged RL is computationally intensive (16k H100 GPU-hours for a 1.5B model).
Iterative Refinement: The training process may involve multiple stages ("runs") where hyperparameters, data mix, or even reward shaping strategies are adjusted based on observed dynamics. For example, penalizing responses that do not terminate correctly.
Context Length Management: Training can start with a shorter context window (e.g., 8k tokens) and be extended later (e.g., 16k tokens) if needed, with the model adapting quickly. This can save computation in earlier stages.

This paper provides a strong case and a practical recipe for using prolonged RL to genuinely improve LLM reasoning, offering a counterpoint to studies suggesting RL primarily refines existing capabilities. The techniques for stabilizing long RL runs are particularly valuable for practitioners.