Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models (2505.12504v1)

Published 18 May 2025 in cs.LG and cs.AI

Abstract: Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of LLMs (LMs) with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs. We release our code at https://github.com/ModalMinds/MM-EUREKA.

Summary

  • The paper presents CPGD, which replaces standard PPO clipping with a log-ratio clipping mechanism and a policy drift constraint to enhance training stability.
  • CPGD mitigates training collapse by using per-token loss decomposition and weighted advantage to prevent gradient explosions and response length collapse.
  • Empirical results demonstrate that CPGD boosts performance by up to +21.8% on multimodal reasoning tasks compared to previous rule-based RL methods.

Rule-based reinforcement learning (RL) has emerged as a promising approach for enhancing the reasoning capabilities of LLMs (LMs), particularly in tasks where deterministic rules can define reward functions, such as verifiable tasks. Methods like GRPO, REINFORCE++, and RLOO have shown success in this domain. However, a significant challenge remains: training instability. The paper "CPGD: Toward Stable Rule-based Reinforcement Learning for LLMs" (2505.12504) identifies that this instability often arises from the use of importance-sampling ratios in loss functions, especially when combined with asymmetric clipping mechanisms like the one used in standard PPO. Large policy updates and improper clipping, particularly with negative advantages, can lead to gradient explosions dominated by poor samples, causing catastrophic training collapse.

To address this instability, the paper proposes Clipped Policy Gradient Optimization with Policy Drift (CPGD). CPGD introduces a novel algorithm designed to stabilize policy learning for LMs trained with rule-based rewards. It deviates from prior methods by replacing the standard PPO-clip loss with a policy gradient loss, which inherently avoids involving the potentially unstable importance-sampling ratio directly in the loss function. To ensure proximal optimization, CPGD incorporates two main mechanisms:

  1. A Clip Mechanism: This mechanism clips the logarithm of the policy ratio (ln(πθ/πθold)\ln(\pi_\theta / \pi_{\theta_{old}})) rather than the ratio itself. This prevents excessively large policy updates when the advantage is positive. Critically, unlike PPO's one-sided clipping, clipping the log ratio provides a more symmetric constraint around ln(1)=0\ln(1) = 0.
  2. A Policy Drift Constraint: This constraint is based on forward KL divergence (DKL(πθold,πθ)D_\text{KL}(\pi_{\theta_{old}}, \pi_{\theta})) between the old and current policies. It acts as a regularizer, dynamically constraining the magnitude of policy updates and mitigating over-optimization. The paper theoretically shows (Proposition 1) that using the policy ratio, as in PPO, amplifies policy shift compared to a policy gradient approach, further justifying the choice of objective.

The paper provides empirical evidence of training collapse in existing methods (like RLOO, REINFORCE++, and variants of GRPO without specific stability mechanisms) when trained on the MMK12 multimodal math dataset. These methods exhibit unstable policy ratio dynamics and often suffer from degraded performance or collapse. In contrast, CPGD demonstrates significantly improved stability, maintaining steady training curves. The ablation studies show that while a basic policy gradient (PG) might be stable regarding the ratio, it can lead to response length collapse, where the model exploits the reward function by generating trivial outputs, highlighting the necessity of mechanisms like clipping for proximal updates to prevent reward hacking and maintain meaningful reasoning behaviors. CPGD, combining the policy gradient objective with both clipping and policy drift, effectively prevents both ratio instability and response length collapse.

For practical implementation, CPGD translates its theoretical formulation into a per-token loss function suitable for autoregressive LMs. The key implementation details include:

  • Per-token Loss Decomposition: The loss is calculated token-wise, leveraging the chain rule and the decomposability of the log-likelihood. The clipping threshold ϵi\epsilon_i can be uniform or scheduled (e.g., tight-to-loose for earlier tokens).
  • Policy Drift Implementation: The forward KL divergence policy drift is implemented using a modified k3k_3 KL estimator's gradient. Crucially, the clipping is applied to the gradient calculation itself (θlnπθ()\nabla_\theta \ln \pi_\theta(\cdot) is weighted by min(sg(πθ)πθold1,c)\min(\frac{\operatorname{sg}(\pi_{\theta})}{\pi_{\theta_{old}}} - 1, c)), not the estimator value. This ensures that when the policy ratio becomes large, the policy drift term still provides a gradient signal that pushes the policy distribution back towards the old policy, unlike clipping the estimator which could lead to incorrect gradient directions or zero gradients in extreme cases.
  • Weighted Advantage: CPGD uses a group-based advantage similar to GRPO (AωCPGD(x,y(k))=ω(x)(Ro(x,y(k))mean({Ro(x,y(k))}k=1K))A^\text{CPGD}_{\omega}(\mathbf{x}, \mathbf{y}^{(k)}) = \omega(\mathbf{x})\cdot(\mathcal{R}_o(\mathbf{x}, \mathbf{y}^{(k)}) - \operatorname{mean}(\{\mathcal{R}_o(\mathbf{x}, \mathbf{y}^{(k^\prime)})\}_{k^\prime=1}^K)), where ω(x)\omega(\mathbf{x}) is a per-prompt weighting factor. Different weighting strategies are explored, including equal weight, STD weight (1/std1/\operatorname{std}), and a clip-filter-like weight (amplifying gradients for prompts with non-zero reward variance). The ablation paper confirms that normalizing rewards by subtracting the mean is crucial to prevent issues like the "squeezing effect" and that non-uniform weighting strategies generally outperform equal weighting by focusing on more informative samples.
  • Integration: The practical CPGD loss function (Equation 2) is designed to be easily integrated into existing large model training frameworks like OpenRLHF and veRL.

The paper evaluates CPGD on multimodal math reasoning benchmarks using the QwenVL2.5-7B base model. CPGD consistently outperforms existing RL algorithms like GRPO, REINFORCE++, and RLOO, as well as other similar-sized open-source models. Compared to the QwenVL2.5-7B base model, CPGD (STD weight) achieves an overall performance gain of +11.0% across six benchmarks, including significant improvements on in-domain (MMK12: +21.8%) and out-of-distribution benchmarks (MathVista: +8.5%, MathVision: +11.4%).

While the primary objective is to improve RL stability, the paper briefly discusses the role of importance sampling and KL divergence types. It notes that omitting the importance-sampling ratio might be acceptable for single PPO epochs, but for multiple epochs, truncated importance sampling could be reintroduced to correct for off-policy distributions, albeit with potential implementation complexity. CPGD utilizes forward KL divergence for policy drift, which avoids the need for importance sampling (required by reverse KL) and is easily decomposable into per-token terms, simplifying implementation. The authors also touch upon the debate regarding RL's contribution to reasoning, suggesting that RL primarily enhances the model's exploitation ability within the exploration space defined by pretraining and supervised fine-tuning.

In conclusion, CPGD offers a robust and stable alternative for rule-based RL in LM post-training. By addressing the inherent instability of importance-sampling ratios through a policy gradient objective combined with clip mechanisms and policy drift, it provides a theoretically grounded and empirically validated method that achieves state-of-the-art performance and training stability on multimodal reasoning tasks.