Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training (2507.12507v1)

Published 16 Jul 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Recent advancements in reasoning-focused LLMs such as OpenAI's O1 and DeepSeek-R1 have shown that scaling test-time computation-through chain-of-thought reasoning and iterative exploration-can yield substantial improvements on complex tasks like mathematics and code generation. These breakthroughs have been driven by large-scale reinforcement learning (RL), particularly when combined with verifiable reward signals that provide objective and grounded supervision. In this report, we investigate the effects of prolonged reinforcement learning on a small LLM across a diverse set of reasoning domains. Our work identifies several key ingredients for effective training, including the use of verifiable reward tasks, enhancements to Group Relative Policy Optimization (GRPO), and practical techniques to improve training stability and generalization. We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains. Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks. To facilitate continued research, we release our model publicly.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper demonstrates that prolonged RL training enhances LLM reasoning by leveraging diverse tasks with verifiable rewards.
It employs advanced GRPO techniques, including decoupled clipping and controlled KL regularization, to balance exploration and exploitation.
Experiments reveal performance gains of up to 54.8% on logic puzzles, highlighting robust scalability and improved generalization.

Essay: "Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training"

Introduction

This paper addresses the enhancement of reasoning capabilities in LLMs through sustained reinforcement learning (RL) training, focusing on unlocking diverse reasoning across various domains such as mathematics, coding, and logical puzzles. The authors introduce a new RL framework incorporating modifications to Group Relative Policy Optimization (GRPO) and highlight the importance of verifiable reward signals and diverse training datasets. The paper demonstrates significant performance improvements over existing strong baselines, positioning the proposed framework as a robust approach for reasoning-intensive tasks.

Figure 1: Performance comparison between DeepSeek-R1-Distill-Qwen-1.5B and Nemotron-Research-Reasoning-Qwen-1.5B across multiple benchmarks.

Diverse Training Data

The core of the training methodology involves leveraging a wide range of tasks with clear, verifiable rewards. This data diversity ensures that the model learns robust decision-making strategies across disparate environments, enhancing generalization. Key domains include mathematics, coding, STEM, logical puzzles, and instruction following. Each domain provides distinct correctness criteria, which are critical for evaluating and improving the reasoning ability of the model.

Reinforcement Learning Framework

The paper utilizes Group Relative Policy Optimization (GRPO) and introduces several enhancements inspired by DAPO, such as decoupled clipping and dynamic sampling. These innovations aim to maintain exploration by controlling entropy and leveraging diverse training experiences. One of the critical techniques discussed is controlled KL regularization, employed to manage the balance between exploration and exploitation effectively. Additionally, the paper suggests periodic reference policy resets to mitigate performance plateaus, thus enabling continuous model improvement throughout prolonged training.

Figure 2: KL divergence.

Implementation Details

The implementation involves a sandboxed reward server architecture to handle the complexity and security requirements of diverse tasks. This architecture facilitates tailored execution environments without affecting the core training processes. By distributing reward evaluations across clusters, the paper ensures high throughput and efficient management of computational resources, supporting the scalable deployment of reinforcement learning strategies.

Experiment Results

Numerical results highlight significant improvements across several reasoning tasks, with increases of up to 54.8% on logic puzzles, 14.7% on mathematics, and 13.9% on coding tasks as compared to existing baselines. Evaluation on a robust validation set shows that the proposed model, Nemotron-Research-Reasoning-Qwen-1.5B, not only outperforms its predecessors but also maintains competitive performance even when juxtaposed with domain-specialized models like DeepScaleR-1.5B and DeepCoder-1.5B.

Figure 3: Comparision with DeepScaleR-1.5B~\cite{deepscaler2025}.

Ablation Studies

Ablation studies conducted address the impact of different model components and configurations. The analysis of temperature sampling reveals that higher temperatures promote exploration, thereby sustaining learning over extended durations. The proactive resetting of reference policies emerges as a critical factor in maintaining model stability and preventing stagnation.

Conclusion

Through careful design and sustained reinforcement learning, the paper showcases substantial gains in reasoning capabilities of LLMs. By open-sourcing the Nemotron-Research-Reasoning-Qwen-1.5B model, the paper contributes valuable insights and resources to the broader research community, potentially advancing further studies in model alignment, optimization, and reasoning across limited computational demands. These findings underscore the transformative potential of prolonged training strategies in enhancing the intellectual capabilities of LLMs across diverse domains.

PDF Markdown

Follow-up Questions

Related Papers

Authors (22)

First 10 authors:

Tweets

https://twitter.com/dair_ai/status/1946966348518547797

https://twitter.com/ericssunLeon/status/1949140931711566297

alphaXiv

Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training (45 likes, 1 question)