Gradient-Preserving Clipping Policy Optimization (GPPO)

Updated 12 August 2025

GPPO is an advanced reinforcement learning approach that preserves gradients beyond clipping thresholds to maintain critical exploration signals.
It decouples forward computation from gradient propagation, ensuring that both high-entropy actions and negative feedback remain effective during optimization.
Empirical results in Klear-Reasoner demonstrate that GPPO improves learning efficiency and robust performance on complex reasoning tasks.

Gradient-Preserving Clipping Policy Optimization (GPPO) is an advanced reinforcement learning (RL) methodology designed to overcome the limitations of hard-clipping strategies in traditional policy gradient methods, notably Proximal Policy Optimization (PPO). In standard RL fine-tuning pipelines, especially for large-scale reasoning models as exemplified by Klear-Reasoner (Su et al., 11 Aug 2025), hard clipping of importance sampling ratios during RL truncates gradients outside a trust region, suppressing critical exploration and discarding informative negative feedback. GPPO modifies the backward gradient computation to preserve and cap gradients at the clipping thresholds, thereby enhancing both the explorative behavior and the correction of suboptimal actions. This targeted intervention at the token level is primarily realized in LLMs undergoing group-relative RL post-training, and is evidenced to improve learning efficiency and final task performance.

1. Motivation: Limitations of Traditional Clipping in Policy Gradient Optimization

Hard clipping in standard PPO or GRPO confines the importance sampling ratio, $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}$ , to a bounded interval $[1-\varepsilon, 1+\varepsilon]$ , leading to the clipped surrogate objective: $L^{\text{PPO}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \operatorname{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) \hat{A}_t \right) \right]$ When $r_t(\theta)$ falls outside $[1-\varepsilon, 1+\varepsilon]$ , the associated gradient is set to zero. This entirely cuts off gradient contributions from high-entropy (exploratory) actions ( $r_t(\theta) > 1 + \varepsilon$ ) and negative advantage samples ( $r_t(\theta) < 1 - \varepsilon$ ), thereby:

Suppressing critical exploration signals at key decision points (leading to premature policy convergence)
Ignoring negative feedback from suboptimal choices, slowing policy correction

In complex reasoning tasks, where exploration at intermediate steps is essential and the distribution of learning signals is highly non-uniform, such aggressive gradient truncation directly impacts both coverage and convergence speed (Su et al., 11 Aug 2025).

2. Design of the GPPO Mechanism: Gradient Preservation Beyond Clipping

GPPO addresses these deficiencies by decoupling the forward pass (objective computation) from the backward pass (gradient computation). In Klear-Reasoner, when optimizing the group-relative objective during RL, the forward computation remains numerically identical to PPO due to application of a stop-gradient operator. However, during the backward pass, rather than zeroing gradients outside the clipping region, GPPO substitutes $r_t(\theta)$ by the nearest clipping bound, ensuring a bounded but nonzero gradient for these tokens.

Formally, the GPPO objective for group-wise token-level optimization is: $L^{\text{GPPO}}(\theta) = \mathbb{E}_x\left[ \frac{1}{\sum_j T_j} \sum_j \sum_t \min\left( \delta \widetilde{V}^{(j)}, \operatorname{clip}(\delta, \frac{1-\varepsilon_l}{\mathrm{sg}(\delta)}, \frac{1+\varepsilon_h}{\mathrm{sg}(\delta)}) \widetilde{V}^{(j)} \right) \right]$ where:

$\delta = r_t^{(j)}(\theta)$ is the token-level importance sampling ratio for group $j$
$\widetilde{V}^{(j)}$ is the (possibly normalized) advantage for sequence $j$
$\mathrm{sg}(\cdot)$ denotes the stop-gradient operator, freezing the value in the backward graph but not affecting forward computation

The effect is that, during backpropagation, for tokens:

$r_t(\theta) > 1+\varepsilon_h$ with positive advantage, GPPO passes a gradient corresponding to $1+\varepsilon_h$
$r_t(\theta) < 1-\varepsilon_l$ with negative advantage, GPPO passes a gradient corresponding to $1-\varepsilon_l$
Otherwise, the actual $r_t(\theta)$ is used

Hence, no token is completely masked out, and the learning signal is gently capped at the boundary, not erased.

3. Addressing Exploration and Negative Feedback: Mechanistic Insights

GPPO specifically resolves two core issues:

Preserving Exploration Signals: By capping the gradients of high-entropy/exploratory actions at $1+\varepsilon_h$ instead of zeroing, GPPO maintains reinforcement signals from tokens that contribute to the exploration of new or rare trajectories. This is crucial for reasoning models that must discover novel solution paths or explore diverse intermediate states.
Leveraging Negative Samples: By similarly capping gradients for negative-advantage tokens at $1-\varepsilon_l$ (rather than suppressing them), GPPO ensures the agent continues to efficiently learn from its missteps, particularly accelerating correction of rare or pathological failure cases.

This gradient-capping mechanism enables the policy to adjust both upwards (for improved actions) and downwards (for corrections) even outside the trusted region, without destabilizing updates.

4. Mathematical Formulation and Implementation Details

The composite GPPO gradient with respect to parameters $\theta$ is: $\nabla_\theta L^{\text{GPPO}}(\theta) = \mathbb{E}_x \left[ \frac{1}{\sum_j T_j} \sum_j \sum_t \mathcal{F}_{j,t}(\theta) \cdot \phi_\theta(a_{j,t}, s_{j,t}) \cdot \widetilde{V}^{(j)} \right]$ with

$\mathcal{F}_{j,t}(\theta) = \begin{cases} 1-\varepsilon_l & \text{if}\ \delta < 1-\varepsilon_l \text{ and }\widetilde{V}^{(j)} < 0 \ 1+\varepsilon_h & \text{if}\ \delta > 1+\varepsilon_h \text{ and }\widetilde{V}^{(j)} > 0 \ \delta & \text{otherwise} \end{cases}$

where $\phi_\theta(a, s)$ is the derivative of the policy logits.

The stop-gradient operator ensures that the forward loss is unchanged with respect to the classical PPO/GRPO computation—guaranteeing equivalent trust-region enforcement in value space—while only modifying the gradient flow through out-of-bound tokens during optimization. This implementation is particularly suited for RL post-training of LLMs, as it requires only local modifications in the RL loss definition—effectively a change in how autograd propagates gradients through the clipping computation.

5. Empirical Results in Klear-Reasoner and Impact on Reasoning

Integrating GPPO into Klear-Reasoner's RL fine-tuning led to measurable improvements in both exploration capacity and correction from negative feedback. Notably:

The model achieved 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5, and 58.1% on LiveCodeBench V6 (Su et al., 11 Aug 2025).
Iterative ablations showed that models trained with GPPO-based RL demonstrated more robust performance on long-chain tasks and were less susceptible to premature convergence or mode collapse.
In the supervised fine-tuning phase, the model benefited from high-quality, but lower-coverage, training data. In combination with GPPO RL, this allowed superior exploration (thanks to retained gradients for exploratory actions) and robust correction (due to preserved feedback from negative samples).

This indicates that GPPO is particularly effective when the learning environment is multimodal, reward landscapes are sparse or skewed, and exploration is otherwise prone to suppression under hard clipping.

6. Broader Theoretical and Algorithmic Context

The concept of preserving gradients at or beyond clipping boundaries is aligned with several recent trends in RL optimization:

Gradient-preserving or "soft" clipping variants—using smooth or capped gradients rather than binary truncations—are observed to improve sample efficiency and learning stability (Chen et al., 2022, Markowitz et al., 2023).
Analytical studies of clipping in nonconvex and stochastic settings have demonstrated that hard-clipped SGD introduces bias and can slow correction in the presence of persistent noise (Koloskova et al., 2023, Li et al., 2023). In GPPO, the controlled capping reduces this bias and accelerates convergence.
GPPO's stop-gradient decoupling retains the original trust-region behavior in expected policy divergence, while offering improved gradient coverage. This leads to more effective transfer of supervised fine-tuning to RL, especially on reasoning benchmarks that require discovering rare solution paths or correcting subtle errors.

7. Summary Table: Hard Clipping vs. Gradient-Preserving Clipping

Feature	Hard Clipping (PPO/GRPO)	Gradient-Preserving (GPPO)
Gradient outside clip region	Zeroed	Capped at boundary value
Exploration signal	Suppressed for high $r_t$	Retained via nonzero gradient
Suboptimal correction	Delayed, gradients masked	Accelerated, preserved feedback
Forward loss	Original trust-region surrogate	Identical via stop-gradient
Empirical effect (Klear-Reasoner)	Prone to premature convergence	Enhanced reasoning, exploration

8. Conclusion

Gradient-Preserving Clipping Policy Optimization (GPPO) generalizes classical clipping in RL by preserving, rather than extinguishing, gradient signals outside the trust-region boundaries, using stop-gradient-based capping in the backward pass. This mechanism resolves critical shortcomings of prior approaches—namely, suppression of high-entropy exploratory actions and masking of negative feedback on suboptimal trajectories. Empirical evidence from Klear-Reasoner demonstrates substantial gains in mathematical and programmatic reasoning tasks, confirming that GPPO improves both exploration and learning efficiency without destabilizing training dynamics. GPPO’s theoretical underpinnings and practical realization mark it as a structurally significant advance in reinforcement learning optimization, especially for large-scale, chain-of-thought reasoning models (Su et al., 11 Aug 2025).