DAPO: An Open-Source LLM Reinforcement Learning System at Scale (2503.14476v2)

Published 18 Mar 2025 in cs.LG and cs.CL

Abstract: Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

Summary

The paper introduces DAPO, an open-source system designed to scale reinforcement learning for large language models by addressing common issues like entropy collapse, vanishing gradients, token aberrations, and reward noise.
DAPO integrates four key techniques: decoupled clipping for better exploration, dynamic sampling to maintain gradient signals, token-level policy gradient loss to fix errors, and overlong reward shaping to mitigate truncation noise.
Empirical results show DAPO achieving a score of 50 on the AIME 2024 benchmark with Qwen2.5-32B, surpassing prior methods' score of 47 with 50% fewer training steps, and the open-source release aids reproducibility and future research.

System Overview

The DAPO system, short for Decoupled Clip and Dynamic sAmpling Policy Optimization, is an open-source framework built to scale reinforcement learning for LLMs. Designed atop the verl framework, DAPO is engineered to overcome several critical issues encountered in RL training of LLMs such as entropy collapse, ineffective gradient signals, token-level aberrations during chain-of-thought (CoT) generation, and reward noise due to truncation. The system has been validated on the AIME 2024 benchmark, achieving a score of 50 with the Qwen2.5-32B base model—a significant improvement over prior state-of-the-art models which achieved 47 points with substantially more training steps.

Four Key Techniques for DAPO

1. Decoupled Clipping (Clip-Higher)

The decoupled clipping mechanism addresses the well-known problem of entropy collapse in RL paradigms. In typical implementations, clipping the policy update can overly constrain the probability mass distribution, leading to reduced exploration. DAPO mitigates this by decoupling the lower and higher clipping thresholds, thereby allowing tokens with lower probabilities to receive an increased probability mass when warranted. This mechanism supports a more diverse sampling of tokens, crucial for maintaining generative quality in complex tasks.

2. Dynamic Sampling

Dynamic Sampling is implemented to counteract the issue where prompt accuracy saturates at either 1 or 0, which in turn leads to diminishing gradients. Through over-sampling and rigorous filtering of such prompts, the system ensures that each training batch retains a statistically significant number of samples with non-trivial gradients. This technique dynamically adjusts the sample distribution throughout the training process, effectively maintaining a robust gradient signal and enhancing convergence stability.

3. Token-Level Policy Gradient Loss

In long-CoT scenarios, the aggregate sample-level loss can mask adverse token-level phenomena. DAPO introduces a token-level policy gradient loss computation that penalizes entities at the token level instead of using a coarse sample-level metric. This fine-grained loss assignment allows for the correction of undesirable sequences, limiting the spurious increases in entropy and preventing the generation of excessively long responses, which can detrimentally affect downstream reasoning and output quality.

4. Overlong Reward Shaping

Given the truncation issues inherent in many RL training regimes, reward noise can be a significant barrier to effective training. DAPO incorporates an overlong reward shaping mechanism, using either an overlong filtering strategy or a length-aware penalty mechanism known as Soft Overlong Punishment. This technique directly attenuates the noisiness introduced by truncated responses and steers the training dynamics away from highly variable or undesirably elongated outputs.

Performance Evaluation and Reproducibility

Empirically, DAPO achieves a 50-point score on the AIME 2024 dataset with the Qwen2.5-32B base model. Notably, this performance is achieved with 50% fewer training steps compared to prior approaches—the DeepSeek-R1-Zero-Qwen-32B—which scored 47 points. This result underscores the efficiency gains from the four key innovations. The open-sourcing of the training code along with a carefully curated dataset not only facilitates reproducibility but also lays a foundation for future iterations and further community-driven research. The availability of the training framework promotes transparency, allowing deep integration and cross-validation by practitioners and researchers in the field.

Implementation Considerations and Deployment Strategies

For organizations or research groups looking to implement DAPO, several key practical considerations should be noted:

Computational Resources: Given the scale of models like Qwen2.5-32B, significant GPU clusters and distributed computing frameworks are typically required. Ensure compatibility with the verl framework for efficient parallelization.
Data Curation: The dataset provided as part of the open-source release is pre-processed but additional domain-specific data curation may be necessary depending on application needs.
Hyperparameter Tuning: The decoupled clipping thresholds, dynamic sampling rates, token-level loss parameters, and the penalty coefficients for overlong responses demand careful tuning. Early experiments should include extensive ablation studies.
Integration with Existing Pipelines: The modular nature of DAPO allows for integration with existing LLM training pipelines. Adaptation to different base models requires re-tuning of reward shaping and sampling parameters.
Reproducibility Standards: The open-source release adheres to strict reproducibility standards, promoting rigorous comparison. Adopting a similar standard is recommended for downstream research.

In summary, the DAPO system represents a comprehensive approach to LLM reinforcement learning at scale. By integrating refined techniques such as decoupled clipping, dynamic sampling, token-level policy gradient loss, and overlong reward shaping, it achieves superior performance on benchmarks while improving training efficiency and stability. The open-source availability of the system dramatically lowers the barrier to entry and fosters further advancements in large-scale RL for LLMs.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/LangChainJP/status/1903703283317022829

https://twitter.com/Underfox3/status/1902230542407360897

https://twitter.com/sorceressofmath/status/1904630549806719473

https://twitter.com/TheTuringPost/status/1902507220832809285

https://twitter.com/edwardbeeching/status/1902705173006344594

https://twitter.com/UtopicDev/status/1902443242940928149

YouTube

Show All Videos