Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization (2508.07629v2)

Published 11 Aug 2025 in cs.LG, cs.AI, and cs.CL

Abstract: We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks. Although there are already many excellent works related to inference models in the current community, there are still many problems with reproducing high-performance inference models due to incomplete disclosure of training details. This report provides an in-depth analysis of the reasoning model, covering the entire post-training workflow from data preparation and long Chain-of-Thought supervised fine-tuning (long CoT SFT) to reinforcement learning (RL), along with detailed ablation studies for each experimental component. For SFT data, our experiments show that a small number of high-quality data sources are more effective than a large number of diverse data sources, and that difficult samples can achieve better results without accuracy filtering. In addition, we investigate two key issues with current clipping mechanisms in RL: Clipping suppresses critical exploration signals and ignores suboptimal trajectories. To address these challenges, we propose Gradient-Preserving clipping Policy Optimization (GPPO) that gently backpropagates gradients from clipped tokens. GPPO not only enhances the model's exploration capacity but also improves its efficiency in learning from negative samples. Klear-Reasoner exhibits exceptional reasoning abilities in mathematics and programming, scoring 90.5% on AIME 2024, 83.2% on AIME 2025, 66.0% on LiveCodeBench V5 and 58.1% on LiveCodeBench V6.

Summary

The paper introduces GPPO, a gradient-preserving clipping algorithm that enhances exploration and learning stability in RL fine-tuning.
It employs a high-quality data curation and long chain-of-thought supervised fine-tuning strategy to improve reasoning in mathematics and programming.
Empirical results demonstrate state-of-the-art performance on benchmarks like AIME and LiveCodeBench, validating the effectiveness of the approach.

Klear-Reasoner: Gradient-Preserving Clipping Policy Optimization for Advanced Reasoning

Introduction

Klear-Reasoner introduces a comprehensive post-training pipeline for enhancing the reasoning capabilities of LLMs in mathematics and programming. The core innovation is the Gradient-Preserving Clipping Policy Optimization (GPPO) algorithm, which addresses the limitations of traditional clipping in reinforcement learning (RL) by preserving gradient information from all tokens, including those previously discarded by clipping. The paper also details a quality-centric data curation and long Chain-of-Thought (CoT) supervised fine-tuning (SFT) strategy, and provides extensive ablation studies on data, reward design, and RL optimization. Klear-Reasoner-8B achieves strong results on AIME2024 (90.5%), AIME2025 (83.2%), LiveCodeBench V5 (66.0%), and LiveCodeBench V6 (58.1%), surpassing or matching state-of-the-art models of comparable scale.

Data Curation and Supervised Fine-Tuning

The SFT phase is built on a compact, high-quality dataset, prioritizing data quality over diversity. Prompts are sourced from OpenThoughts, NuminaMath, AceReason-Nemotron 1.1 for mathematics, and OpenThoughts, OpenCodeReasoning, TACO, Apps, and Codeforces for coding. Strict deduplication and contamination filtering are applied, including 9-gram overlap removal against test sets. Responses are generated by a strong teacher model (DeepSeek-R1-0528), and all responses are retained, leveraging the finding that difficult samples—even if incorrect—can enhance learning for hard tasks.

Ablation studies demonstrate that:

For easy tasks, training on only correct data is optimal.
For hard tasks, including incorrect responses improves performance, as they provide valuable contrastive signals for exploration.
Using a small number of high-quality sources yields better results than aggregating diverse, lower-quality data.

Reinforcement Learning and GPPO

Limitations of Traditional Clipping

Standard PPO and its variants (e.g., GRPO) employ clipping to stabilize policy updates by truncating the importance sampling ratio. However, this approach:

Discards gradients from high-entropy tokens (critical for exploration) if their ratios exceed the upper bound.
Prevents learning from suboptimal trajectories (negative samples) if their ratios fall below the lower bound, slowing convergence.

GPPO: Algorithmic Details

GPPO modifies the clipping mechanism to preserve gradient flow for all tokens, including those outside the clipping range. The forward computation remains unchanged, but the backward pass constrains the gradients of out-of-bound tokens to the clipping boundary, rather than zeroing them. This ensures:

High-entropy tokens continue to contribute to exploration.
Negative samples accelerate convergence by providing learning signals.
Training stability is maintained, as gradients are bounded.

The GPPO loss for token-level policy optimization is:

$\mathcal{L}^{\text{GPPO}}(\theta) = \mathbb{E}_{x\sim\mathcal{D}} \left[ \frac{1}{\sum_{j=1}^M T_j} \sum_{j=1}^M \sum_{t=1}^{T_j} \min\left( \delta\tilde{A}^{(j)}, \text{clip}\left(\delta, \frac{1-\epsilon_l}{\operatorname{sg}(\delta)} \delta, \frac{1+\epsilon_h}{\operatorname{sg}(\delta)} \delta\right)\tilde{A}^{(j)} \right) \right]$

where $\delta$ is the importance sampling ratio, and $\tilde{A}^{(j)}$ is the group-relative advantage.

Empirical Comparison

GPPO is compared to GRPO with Clip-Higher and CISPO. GPPO consistently achieves higher and more stable performance on both mathematical and code RL benchmarks.

Figure 1: A comparison of GPPO, GRPO w/ Clip Higher, and CISPO in mathematical RL training, showing superior and more stable learning for GPPO.

Reward Design and RL Data Filtering

Soft vs. Hard Reward in Code RL

Sparse rewards in code RL (reward only if all test cases pass) hinder learning. Klear-Reasoner adopts a soft reward proportional to the test case pass rate, providing denser and more informative feedback. This approach:

Increases average reward and reduces variance.
Improves final code RL performance (e.g., +1.8 points on LiveCodeBench V5).

Figure 2: Soft reward strategies yield higher and more stable rewards than hard reward strategies in code RL.

Data Filtering

For code RL, prompts are filtered to retain only those with high estimated $pass@16$ (≥0.5), reducing noise from faulty test cases. For math RL, prompts are filtered based on rule-based validation of completions. Filtering improves both learning stability and final performance.

Figure 3: Code RL performance on LiveCodeBench V5 improves with filtered data compared to unfiltered data.

Zero-Advantage Group Filtering

In GRPO, groups where all responses have zero advantage dilute the optimization signal. Filtering out these groups leads to more stable and consistent improvements in math RL.

Figure 4: Math RL performance on AIME2024 is more stable and improves faster when zero-advantage groups are filtered.

Training and Scaling Considerations

Klear-Reasoner-8B is initialized from Qwen3-8B-Base, fine-tuned with long CoT SFT, and then RL-finetuned on math and code tasks. Training is performed with a maximum sequence length of 32K tokens, but inference is extended to 64K using YaRN scaling. RL is conducted in two stages (math, then code), with joint optimization of GPPO and SFT losses (weighted by $\alpha=0.1$ ). No KL loss is used.

The model achieves strong results even with a 32K training context, and RL fine-tuning compensates for smaller SFT data volume, outperforming models trained with much larger SFT datasets.

Theoretical and Practical Implications

The GPPO algorithm provides a principled solution to the exploration-exploitation trade-off in RL for LLMs, enabling stable yet effective learning from both positive and negative samples. The findings on data curation challenge the prevailing emphasis on data diversity, instead highlighting the importance of high-quality, internally consistent reasoning data. The reward and data filtering strategies further demonstrate the necessity of careful RL signal design for complex reasoning tasks.

Practically, these insights inform the design of future LLM training pipelines for advanced reasoning, suggesting that:

Data quality and difficulty-aware curation are critical.
RL algorithms should preserve gradient information from all samples, not just those within clipping bounds.
Reward shaping and data filtering are essential for stable and efficient RL.

Future Directions

Potential extensions include:

Scaling GPPO to larger models and more diverse reasoning domains.
Automated difficulty estimation and dynamic data selection during training.
Further exploration of reward shaping and curriculum learning in RL for LLMs.
Integration with more advanced context window extension techniques for even longer reasoning chains.

Conclusion

Klear-Reasoner demonstrates that principled data curation, targeted SFT, and gradient-preserving RL optimization can jointly deliver substantial improvements in long-form reasoning for LLMs. The GPPO algorithm addresses key limitations of traditional RL clipping, enabling both stable training and enhanced exploration. These contributions provide a robust foundation for future research in advanced reasoning with LLMs.

PDF Markdown

Follow-up Questions

Related Papers

Authors (10)

Tweets

https://twitter.com/_akhaliq/status/1955112560052568308

https://twitter.com/HuggingPapers/status/1955300225846751471

https://twitter.com/techwith_ram/status/1955331777670811775

alphaXiv

Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization (26 likes, 0 questions)