Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Gradient-Preserving Clipping Policy Optimization (GPPO)

Updated 12 August 2025
  • GPPO is an advanced reinforcement learning approach that preserves gradients beyond clipping thresholds to maintain critical exploration signals.
  • It decouples forward computation from gradient propagation, ensuring that both high-entropy actions and negative feedback remain effective during optimization.
  • Empirical results in Klear-Reasoner demonstrate that GPPO improves learning efficiency and robust performance on complex reasoning tasks.

Gradient-Preserving Clipping Policy Optimization (GPPO) is an advanced reinforcement learning (RL) methodology designed to overcome the limitations of hard-clipping strategies in traditional policy gradient methods, notably Proximal Policy Optimization (PPO). In standard RL fine-tuning pipelines, especially for large-scale reasoning models as exemplified by Klear-Reasoner (Su et al., 11 Aug 2025), hard clipping of importance sampling ratios during RL truncates gradients outside a trust region, suppressing critical exploration and discarding informative negative feedback. GPPO modifies the backward gradient computation to preserve and cap gradients at the clipping thresholds, thereby enhancing both the explorative behavior and the correction of suboptimal actions. This targeted intervention at the token level is primarily realized in LLMs undergoing group-relative RL post-training, and is evidenced to improve learning efficiency and final task performance.

1. Motivation: Limitations of Traditional Clipping in Policy Gradient Optimization

Hard clipping in standard PPO or GRPO confines the importance sampling ratio, rt(θ)=πθ(atst)πold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}, to a bounded interval [1ε,1+ε][1-\varepsilon, 1+\varepsilon], leading to the clipped surrogate objective: LPPO(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ε,1+ε)A^t)]L^{\text{PPO}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \operatorname{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) \hat{A}_t \right) \right] When rt(θ)r_t(\theta) falls outside [1ε,1+ε][1-\varepsilon, 1+\varepsilon], the associated gradient is set to zero. This entirely cuts off gradient contributions from high-entropy (exploratory) actions (rt(θ)>1+εr_t(\theta) > 1 + \varepsilon) and negative advantage samples (rt(θ)<1εr_t(\theta) < 1 - \varepsilon), thereby:

  • Suppressing critical exploration signals at key decision points (leading to premature policy convergence)
  • Ignoring negative feedback from suboptimal choices, slowing policy correction

In complex reasoning tasks, where exploration at intermediate steps is essential and the distribution of learning signals is highly non-uniform, such aggressive gradient truncation directly impacts both coverage and convergence speed (Su et al., 11 Aug 2025).

2. Design of the GPPO Mechanism: Gradient Preservation Beyond Clipping

GPPO addresses these deficiencies by decoupling the forward pass (objective computation) from the backward pass (gradient computation). In Klear-Reasoner, when optimizing the group-relative objective during RL, the forward computation remains numerically identical to PPO due to application of a stop-gradient operator. However, during the backward pass, rather than zeroing gradients outside the clipping region, GPPO substitutes rt(θ)r_t(\theta) by the nearest clipping bound, ensuring a bounded but nonzero gradient for these tokens.

Formally, the GPPO objective for group-wise token-level optimization is: LGPPO(θ)=Ex[1jTjjtmin(δV~(j),clip(δ,1εlsg(δ),1+εhsg(δ))V~(j))]L^{\text{GPPO}}(\theta) = \mathbb{E}_x\left[ \frac{1}{\sum_j T_j} \sum_j \sum_t \min\left( \delta \widetilde{V}^{(j)}, \operatorname{clip}(\delta, \frac{1-\varepsilon_l}{\mathrm{sg}(\delta)}, \frac{1+\varepsilon_h}{\mathrm{sg}(\delta)}) \widetilde{V}^{(j)} \right) \right] where:

  • δ=rt(j)(θ)\delta = r_t^{(j)}(\theta) is the token-level importance sampling ratio for group jj
  • V~(j)\widetilde{V}^{(j)} is the (possibly normalized) advantage for sequence jj
  • sg()\mathrm{sg}(\cdot) denotes the stop-gradient operator, freezing the value in the backward graph but not affecting forward computation

The effect is that, during backpropagation, for tokens:

  • rt(θ)>1+εhr_t(\theta) > 1+\varepsilon_h with positive advantage, GPPO passes a gradient corresponding to 1+εh1+\varepsilon_h
  • rt(θ)<1εlr_t(\theta) < 1-\varepsilon_l with negative advantage, GPPO passes a gradient corresponding to 1εl1-\varepsilon_l
  • Otherwise, the actual rt(θ)r_t(\theta) is used

Hence, no token is completely masked out, and the learning signal is gently capped at the boundary, not erased.

3. Addressing Exploration and Negative Feedback: Mechanistic Insights

GPPO specifically resolves two core issues:

  • Preserving Exploration Signals: By capping the gradients of high-entropy/exploratory actions at 1+εh1+\varepsilon_h instead of zeroing, GPPO maintains reinforcement signals from tokens that contribute to the exploration of new or rare trajectories. This is crucial for reasoning models that must discover novel solution paths or explore diverse intermediate states.
  • Leveraging Negative Samples: By similarly capping gradients for negative-advantage tokens at 1εl1-\varepsilon_l (rather than suppressing them), GPPO ensures the agent continues to efficiently learn from its missteps, particularly accelerating correction of rare or pathological failure cases.

This gradient-capping mechanism enables the policy to adjust both upwards (for improved actions) and downwards (for corrections) even outside the trusted region, without destabilizing updates.

4. Mathematical Formulation and Implementation Details

The composite GPPO gradient with respect to parameters θ\theta is: θLGPPO(θ)=Ex[1jTjjtFj,t(θ)ϕθ(aj,t,sj,t)V~(j)]\nabla_\theta L^{\text{GPPO}}(\theta) = \mathbb{E}_x \left[ \frac{1}{\sum_j T_j} \sum_j \sum_t \mathcal{F}_{j,t}(\theta) \cdot \phi_\theta(a_{j,t}, s_{j,t}) \cdot \widetilde{V}^{(j)} \right] with

Fj,t(θ)={1εlif δ<1εl and V~(j)<0 1+εhif δ>1+εh and V~(j)>0 δotherwise\mathcal{F}_{j,t}(\theta) = \begin{cases} 1-\varepsilon_l & \text{if}\ \delta < 1-\varepsilon_l \text{ and }\widetilde{V}^{(j)} < 0 \ 1+\varepsilon_h & \text{if}\ \delta > 1+\varepsilon_h \text{ and }\widetilde{V}^{(j)} > 0 \ \delta & \text{otherwise} \end{cases}

where ϕθ(a,s)\phi_\theta(a, s) is the derivative of the policy logits.

The stop-gradient operator ensures that the forward loss is unchanged with respect to the classical PPO/GRPO computation—guaranteeing equivalent trust-region enforcement in value space—while only modifying the gradient flow through out-of-bound tokens during optimization. This implementation is particularly suited for RL post-training of LLMs, as it requires only local modifications in the RL loss definition—effectively a change in how autograd propagates gradients through the clipping computation.

5. Empirical Results in Klear-Reasoner and Impact on Reasoning

Integrating GPPO into Klear-Reasoner's RL fine-tuning led to measurable improvements in both exploration capacity and correction from negative feedback. Notably:

  • The model achieved 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5, and 58.1% on LiveCodeBench V6 (Su et al., 11 Aug 2025).
  • Iterative ablations showed that models trained with GPPO-based RL demonstrated more robust performance on long-chain tasks and were less susceptible to premature convergence or mode collapse.
  • In the supervised fine-tuning phase, the model benefited from high-quality, but lower-coverage, training data. In combination with GPPO RL, this allowed superior exploration (thanks to retained gradients for exploratory actions) and robust correction (due to preserved feedback from negative samples).

This indicates that GPPO is particularly effective when the learning environment is multimodal, reward landscapes are sparse or skewed, and exploration is otherwise prone to suppression under hard clipping.

6. Broader Theoretical and Algorithmic Context

The concept of preserving gradients at or beyond clipping boundaries is aligned with several recent trends in RL optimization:

  • Gradient-preserving or "soft" clipping variants—using smooth or capped gradients rather than binary truncations—are observed to improve sample efficiency and learning stability (Chen et al., 2022, Markowitz et al., 2023).
  • Analytical studies of clipping in nonconvex and stochastic settings have demonstrated that hard-clipped SGD introduces bias and can slow correction in the presence of persistent noise (Koloskova et al., 2023, Li et al., 2023). In GPPO, the controlled capping reduces this bias and accelerates convergence.
  • GPPO's stop-gradient decoupling retains the original trust-region behavior in expected policy divergence, while offering improved gradient coverage. This leads to more effective transfer of supervised fine-tuning to RL, especially on reasoning benchmarks that require discovering rare solution paths or correcting subtle errors.

7. Summary Table: Hard Clipping vs. Gradient-Preserving Clipping

Feature Hard Clipping (PPO/GRPO) Gradient-Preserving (GPPO)
Gradient outside clip region Zeroed Capped at boundary value
Exploration signal Suppressed for high rtr_t Retained via nonzero gradient
Suboptimal correction Delayed, gradients masked Accelerated, preserved feedback
Forward loss Original trust-region surrogate Identical via stop-gradient
Empirical effect (Klear-Reasoner) Prone to premature convergence Enhanced reasoning, exploration

8. Conclusion

Gradient-Preserving Clipping Policy Optimization (GPPO) generalizes classical clipping in RL by preserving, rather than extinguishing, gradient signals outside the trust-region boundaries, using stop-gradient-based capping in the backward pass. This mechanism resolves critical shortcomings of prior approaches—namely, suppression of high-entropy exploratory actions and masking of negative feedback on suboptimal trajectories. Empirical evidence from Klear-Reasoner demonstrates substantial gains in mathematical and programmatic reasoning tasks, confirming that GPPO improves both exploration and learning efficiency without destabilizing training dynamics. GPPO’s theoretical underpinnings and practical realization mark it as a structurally significant advance in reinforcement learning optimization, especially for large-scale, chain-of-thought reasoning models (Su et al., 11 Aug 2025).