Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization (2508.07629v2)

Published 11 Aug 2025 in cs.LG, cs.AI, and cs.CL

Abstract: We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model's exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.

Summary

The paper introduces Gradient-Preserving Clipping Policy Optimization (GPPO) to overcome the limitations of conventional PPO clipping in LLM reasoning.
It integrates refined data curation, supervised fine-tuning, and reinforcement learning strategies, achieving strong performance on AIME and LiveCodeBench benchmarks.
Empirical results, such as 90.5% on AIME2024 and 83.2% on AIME2025, validate GPPO's ability to enable efficient exploration and stable training.

Klear-Reasoner: Gradient-Preserving Clipping for Advanced Reasoning in LLMs

Overview

Klear-Reasoner introduces a principled approach to post-training optimization for LLMs targeting mathematical and code reasoning. The core contribution is Gradient-Preserving Clipping Policy Optimization (GPPO), which addresses the limitations of conventional clipping in policy gradient methods, specifically in PPO and its variants. The work is distinguished by a rigorous ablation of data curation, supervised fine-tuning (SFT), and reinforcement learning (RL) strategies, culminating in strong empirical results on AIME and LiveCodeBench benchmarks.

Motivation and Background

Recent advances in LLM reasoning, exemplified by models such as DeepSeek-R1 and Qwen3-8B, have relied heavily on RL post-training. However, reproducibility and stability issues persist due to incomplete disclosure of training details and the inherent limitations of standard clipping mechanisms in PPO/GRPO. Clipping, while stabilizing training, suppresses critical exploration signals and discards gradients from suboptimal trajectories, impeding both exploration and convergence.

Gradient-Preserving Clipping Policy Optimization (GPPO)

GPPO is designed to retain gradient information from all tokens, including those outside the clipping bounds, by decoupling the forward computation from the backward gradient flow. In contrast to standard PPO/GRPO, which zeroes gradients for clipped tokens, GPPO constrains the magnitude of these gradients but does not discard them. This enables:

Preservation of high-entropy token gradients: Facilitates exploration at critical decision points.
Accelerated convergence on negative samples: Allows the model to learn efficiently from suboptimal trajectories.

The GPPO loss is formulated as:

$\mathcal{L}^{\text{GPPO}}(\theta) = \mathbb{E}_{x\sim\mathcal{D}} \left[ \frac{1}{\sum_{j=1}^M T_j} \sum_{j=1}^M \sum_{t=1}^{T_j} \min\left( \delta\tilde{A}^{(j)}, \text{clip}\left(\delta, \frac{1-\epsilon_l}{\operatorname{sg}(\delta)} \delta, \frac{1+\epsilon_h}{\operatorname{sg}(\delta)} \delta\right)\tilde{A}^{(j)} \right) \right]$

where $\delta$ is the importance sampling ratio and $\tilde{A}^{(j)}$ is the group-relative advantage.

Comparative Analysis: GPPO vs. GRPO and CISPO

Empirical results demonstrate that GPPO outperforms both GRPO with Clip-Higher and CISPO in terms of final scores and training stability on mathematical and code reasoning tasks.

Figure 1: GPPO achieves superior and more stable learning curves compared to GRPO w/ Clip Higher and CISPO in mathematical RL training.

GPPO's pessimistic update strategy, inherited from PPO, suppresses overly optimistic updates for positive advantages while fully leveraging negative feedback, resulting in clearer optimization signals and more robust policy training.

Data Curation and SFT Strategy

The SFT phase employs a quality-centric data selection protocol, prioritizing high-quality sources over diversity. Key findings include:

High-quality, compact datasets outperform large, diverse datasets: Models fine-tuned on top-ranked sources consistently achieve higher accuracy.
Difficulty-dependent correctness filtering: For hard tasks, mixed correct/incorrect data improves performance, while for easy tasks, correctness filtering is beneficial.
Teacher model selection is critical: Stronger teacher models yield more effective distillation and downstream generalization.

Reinforcement Learning: Reward Design and Data Filtering

RL training incorporates several innovations:

Soft reward for code RL: Rewards are proportional to test case pass rates, mitigating reward sparsity and variance, and enabling learning from partially correct solutions.

Figure 2: Soft reward strategies yield higher average rewards and more stable training compared to hard reward baselines in code RL.

Test case filtering for code RL: Prompts with low estimated $pass@16$ are excluded, reducing noise and improving final performance.
Figure 3: Filtering code RL data by $pass@16$ improves LiveCodeBench V5 scores and stabilizes training.
Zero-advantage group filtering for math RL: Excluding groups with vanishing policy gradients focuses optimization on actionable feedback, enhancing generalization.

Figure 4: Filtering zero-advantage groups in math RL leads to more stable and consistent improvements on AIME2024.

Ablation Studies and Hyperparameter Sensitivity

Ablations reveal that moderate SFT supervision during RL (e.g., $\alpha=0.1$ ) improves utilization of positive examples and regularizes policy outputs, but excessive weighting leads to overfitting and diminished exploration. GPPO's gradient control via tunable hyperparameters ( $\beta_1$ , $\beta_2$ ) allows fine-grained adjustment of boundary losses.

Empirical Results

Klear-Reasoner-8B achieves:

90.5% on AIME2024
83.2% on AIME2025
66.0% on LiveCodeBench V5
58.1% on LiveCodeBench V6

These results match or surpass state-of-the-art models of comparable scale, even when trained with a 32K context window and evaluated at 64K using YaRN scaling.

Implications and Future Directions

The GPPO framework provides a principled solution to the exploration-stability trade-off in RL post-training for LLMs. By retaining and constraining gradients from all tokens, it enables efficient learning from both positive and negative samples, facilitating rapid trial-and-error and robust generalization. The findings on data curation and reward design further inform best practices for reasoning-centric LLM development.

Future work may explore:

Generalization of GPPO to other RL objectives and domains
Automated data selection strategies leveraging difficulty and correctness signals
Integration with scalable context extension methods for even longer reasoning chains
Theoretical analysis of gradient dynamics in pessimistic vs. optimistic update regimes

Conclusion

Klear-Reasoner demonstrates that principled data curation, targeted SFT, and gradient-preserving RL optimization can jointly deliver substantial improvements in long-form reasoning for LLMs. The GPPO method addresses critical limitations of traditional clipping, enabling both stable training and enhanced exploration. The empirical results validate the efficacy of these techniques and provide a foundation for further advances in reasoning-focused LLM post-training.

PDF Markdown

Follow-up Questions

Related Papers

Authors (10)

Tweets

https://twitter.com/_akhaliq/status/1955112560052568308

https://twitter.com/HuggingPapers/status/1955300225846751471

https://twitter.com/techwith_ram/status/1955331777670811775

alphaXiv

Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization (26 likes, 0 questions)