CPGD: Clipped Policy Gradient with Drift Control

Updated 20 October 2025

The paper introduces a novel dual regularization approach combining log-space clipping and a KL divergence penalty to control policy drift.
Empirical results show over 10% accuracy improvements and robust performance across in-domain and out-of-domain benchmarks.
CPGD integrates seamlessly with RLHF pipelines, reducing gradient variance and preventing reward exploitation during training.

Clipped Policy Gradient Optimization with Policy Drift (CPGD) is a regularized @@@@1@@@@ algorithm designed to stabilize policy optimization—particularly in LLMs undergoing rule-based post-training. CPGD integrates a clipping mechanism applied to the policy gradient objective and a dynamic policy drift constraint based on KL divergence, mitigating instability caused by large policy shifts found in prior ratio-based RL approaches. This section provides a comprehensive overview of CPGD, its algorithmic components, theoretical foundation, empirical validation, and practical significance.

1. Motivation and Core Principles

CPGD addresses instability in RL for LMs, specifically the risk of training collapse due to large probability ratio updates and improper clipping observed in methods such as GRPO, REINFORCE++, and RLOO (Liu et al., 18 May 2025). Unlike ratio-based PPO or its derivatives—which often suffer from high variance and unbounded policy drift—CPGD reframes the objective using classic policy gradient loss, augmented with a dual regularization scheme. The first component is a clipping mechanism that restricts updates in the log-probability space, while the second is a KL-divergence-based drift constraint that actively penalizes excessive deviations between the current and previous policies.

2. Policy Drift Constraint via KL Divergence

A fundamental element in CPGD is the dynamic control of policy drift. This is accomplished by penalizing the KL divergence: $D_{\text{KL}}(\pi_{\theta_{\text{old}}} || \pi_{\theta}) = \sum_{\mathbf{y}} \pi_{\theta_{\text{old}}}(\mathbf{y}|\mathbf{x}) \ln \frac{\pi_{\theta_{\text{old}}}(\mathbf{y}|\mathbf{x})}{\pi_{\theta}(\mathbf{y}|\mathbf{x})}$ This term acts as a corrective signal, ensuring that updates remain proximal and preventing gradient explosions that could destabilize training. The overall loss function is expressed as: $\mathcal{L}_{\text{CPGD}}(\theta; \theta_{\text{old}}) = \mathbb{E}_{\mathbf{x}\sim\mathcal{D}}\Bigg[ \mathbb{E}_{\mathbf{y}\sim\pi_{\theta_{\text{old}}}(\cdot|\mathbf{x})} \Phi_\theta(\mathbf{x}, \mathbf{y}) -\alpha\, D_{\text{KL}}(\pi_{\theta_{\text{old}}}, \pi_\theta|\mathbf{x}) \Bigg]$ where $\alpha$ is a tunable coefficient balancing reward maximization and drift containment.

3. Clipping in Log-Probability Space

Rather than directly clipping the raw importance-sampling ratio (as in PPO), CPGD applies clipping to the logarithm of the ratio. The clipped token-level objective is: $\Phi_\theta(\mathbf{x}, \mathbf{y}) = \min \left( \ln \frac{\pi_\theta(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\text{old}}}(\mathbf{y}|\mathbf{x})} \cdot A^{\text{CPGD}}, \quad \operatorname{clip}_{\ln(1-\epsilon)}^{\ln(1+\epsilon)}\left( \ln \frac{\pi_\theta(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\text{old}}}(\mathbf{y}|\mathbf{x})} \right) \cdot A^{\text{CPGD}} \right)$ Excessive updates (i.e., log-ratios exceeding the bounds $\ln(1-\epsilon)$ or $\ln(1+\epsilon)$ ) are zeroed, relegating update pressure to the policy drift penalty. This log-space clipping avoids the asymmetric and potentially harmful one-sided clipping of traditional PPO.

4. Theoretical Guarantees

The paper provides mathematical justification for the CPGD design. A key proposition is that ratio-based losses amplify policy drift, while clipping mechanisms in log-space restrict this effect: $\left| \frac{\pi_{\theta_1^{\text{PPO}}}(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\text{old}}}(\mathbf{y}|\mathbf{x})} - 1 \right| > \left| \frac{\pi_{\theta_1^{\text{CPGD}}}(\mathbf{y}|\mathbf{x})}{\pi_{\theta_{\text{old}}}(\mathbf{y}|\mathbf{x})} - 1 \right| = \epsilon$ Convergence is proven under scheduled or static $\epsilon$ , and drift-restricted updates ( $\alpha$ and clipping) guarantee training stability, even in highly volatile reward landscapes.

5. Empirical Performance and Stability

Extensive experiments on mathematical reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMK12, demonstrate that CPGD yields consistent performance enhancements and improved stability over GRPO, REINFORCE++, and RLOO (Liu et al., 18 May 2025). Notable results include:

>10% improvement in overall accuracy in both in-domain and OOD settings
Effective mitigation of training collapse (no reward hacking via trivial responses)
Robust performance across multiple model sizes

CPGD maintains learning dynamics even with aggressive reward structures and outlier distributions, highlighting the value of bounding both update magnitude and policy drift.

6. Integration and Practical Considerations

CPGD is designed for seamless incorporation into token-level RLHF frameworks, including compatibility with OpenRLHF pipelines (Liu et al., 18 May 2025). Its practical advantages include:

Reduced gradient variance
Controlled policy updates preventing reward exploitation
Tolerance for large batch sizes and asynchronous updates Potential challenges include tuning of the clipping threshold $\epsilon$ and drift penalty $\alpha$ . The paper recommends both static settings and schedule-based tightening-loosening strategies based on empirical drift statistics.

7. Implementation and Code Access

CPGD is made publicly available at https://github.com/ModalMinds/MM-EUREKA, enabling reproducibility and further experimentation. Documentation highlights integration guides, hyperparameter recommendations, and example scripts for LLM post-training.

Summary Table: CPGD Characteristics

Feature	CPGD Implementation	Comparison to PPO/GRPO
Regularization	KL drift + log-ratio clipping	Ratio-loss and/or single-sided clip
Loss Formula	Min clipping in log-probability	Clipping on ratio
Convergence Guarantee	Theoretical proof (Theorem 1)	Empirical or heuristic
Reward/Drift Tradeoff	Tuned via $\alpha, \epsilon$	Less explicit
Codebase	MM-EUREKA GitHub	Varied, fewer stability controls

CPGD provides a theoretically justified and empirically validated solution for stable, efficient policy optimization in RL for LLMs, overcoming limitations of prior approaches and balancing reward maximization with robust drift control (Liu et al., 18 May 2025).

PDF Markdown Chat (Pro)

References (1)

CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models (2025)

Follow Topic

Get notified by email when new papers are published related to Clipped Policy Gradient Optimization with Policy Drift (CPGD).