CPGD: Clipped Policy Gradient with Drift Control
- The paper introduces a novel dual regularization approach combining log-space clipping and a KL divergence penalty to control policy drift.
- Empirical results show over 10% accuracy improvements and robust performance across in-domain and out-of-domain benchmarks.
- CPGD integrates seamlessly with RLHF pipelines, reducing gradient variance and preventing reward exploitation during training.
Clipped Policy Gradient Optimization with Policy Drift (CPGD) is a regularized @@@@1@@@@ algorithm designed to stabilize policy optimization—particularly in LLMs undergoing rule-based post-training. CPGD integrates a clipping mechanism applied to the policy gradient objective and a dynamic policy drift constraint based on KL divergence, mitigating instability caused by large policy shifts found in prior ratio-based RL approaches. This section provides a comprehensive overview of CPGD, its algorithmic components, theoretical foundation, empirical validation, and practical significance.
1. Motivation and Core Principles
CPGD addresses instability in RL for LMs, specifically the risk of training collapse due to large probability ratio updates and improper clipping observed in methods such as GRPO, REINFORCE++, and RLOO (Liu et al., 18 May 2025). Unlike ratio-based PPO or its derivatives—which often suffer from high variance and unbounded policy drift—CPGD reframes the objective using classic policy gradient loss, augmented with a dual regularization scheme. The first component is a clipping mechanism that restricts updates in the log-probability space, while the second is a KL-divergence-based drift constraint that actively penalizes excessive deviations between the current and previous policies.
2. Policy Drift Constraint via KL Divergence
A fundamental element in CPGD is the dynamic control of policy drift. This is accomplished by penalizing the KL divergence: This term acts as a corrective signal, ensuring that updates remain proximal and preventing gradient explosions that could destabilize training. The overall loss function is expressed as: where is a tunable coefficient balancing reward maximization and drift containment.
3. Clipping in Log-Probability Space
Rather than directly clipping the raw importance-sampling ratio (as in PPO), CPGD applies clipping to the logarithm of the ratio. The clipped token-level objective is: Excessive updates (i.e., log-ratios exceeding the bounds or ) are zeroed, relegating update pressure to the policy drift penalty. This log-space clipping avoids the asymmetric and potentially harmful one-sided clipping of traditional PPO.
4. Theoretical Guarantees
The paper provides mathematical justification for the CPGD design. A key proposition is that ratio-based losses amplify policy drift, while clipping mechanisms in log-space restrict this effect: Convergence is proven under scheduled or static , and drift-restricted updates ( and clipping) guarantee training stability, even in highly volatile reward landscapes.
5. Empirical Performance and Stability
Extensive experiments on mathematical reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMK12, demonstrate that CPGD yields consistent performance enhancements and improved stability over GRPO, REINFORCE++, and RLOO (Liu et al., 18 May 2025). Notable results include:
- >10% improvement in overall accuracy in both in-domain and OOD settings
- Effective mitigation of training collapse (no reward hacking via trivial responses)
- Robust performance across multiple model sizes
CPGD maintains learning dynamics even with aggressive reward structures and outlier distributions, highlighting the value of bounding both update magnitude and policy drift.
6. Integration and Practical Considerations
CPGD is designed for seamless incorporation into token-level RLHF frameworks, including compatibility with OpenRLHF pipelines (Liu et al., 18 May 2025). Its practical advantages include:
- Reduced gradient variance
- Controlled policy updates preventing reward exploitation
- Tolerance for large batch sizes and asynchronous updates Potential challenges include tuning of the clipping threshold and drift penalty . The paper recommends both static settings and schedule-based tightening-loosening strategies based on empirical drift statistics.
7. Implementation and Code Access
CPGD is made publicly available at https://github.com/ModalMinds/MM-EUREKA, enabling reproducibility and further experimentation. Documentation highlights integration guides, hyperparameter recommendations, and example scripts for LLM post-training.
Summary Table: CPGD Characteristics
| Feature | CPGD Implementation | Comparison to PPO/GRPO |
|---|---|---|
| Regularization | KL drift + log-ratio clipping | Ratio-loss and/or single-sided clip |
| Loss Formula | Min clipping in log-probability | Clipping on ratio |
| Convergence Guarantee | Theoretical proof (Theorem 1) | Empirical or heuristic |
| Reward/Drift Tradeoff | Tuned via | Less explicit |
| Codebase | MM-EUREKA GitHub | Varied, fewer stability controls |
CPGD provides a theoretically justified and empirically validated solution for stable, efficient policy optimization in RL for LLMs, overcoming limitations of prior approaches and balancing reward maximization with robust drift control (Liu et al., 18 May 2025).