Klear-Reasoner: Advanced Multi-Step Reasoning
- Klear-Reasoner is a large language model optimized for extended reasoning in math and program synthesis, leveraging long chain-of-thought supervised fine-tuning.
- It integrates carefully curated high-quality demonstrations with Gradient-Preserving Clipping Policy Optimization to refine exploration and correct negative outcomes.
- Empirical benchmarks show enhanced accuracy on symbolic reasoning tasks and robust convergence through a joint SFT and RL training strategy.
Klear-Reasoner is a LLM optimized for extended, high-fidelity multi-step reasoning, particularly in mathematical problem solving and program synthesis. It combines long chain-of-thought supervised fine-tuning with an advanced reinforcement learning (RL) post-training algorithm—Gradient-Preserving Clipping Policy Optimization (GPPO)—to achieve state-of-the-art results on mathematical and coding benchmarks. The workflow emphasizes careful curation of training data, robust handling of suboptimal and exploratory outputs during RL, and comprehensive ablations to dissect the impact of core methodological components (Su et al., 11 Aug 2025).
1. Long Chain-of-Thought Supervised Fine-Tuning
Long chain-of-thought supervised fine-tuning (long CoT SFT) is the foundation for Klear-Reasoner's reasoning capacity. Rather than relying on large but diverse and noisy datasets, the SFT phase uses a tightly curated selection of high-quality reasoning demonstrations sourced from repositories such as OpenThoughts, NuminaMath, and OpenCodeReasoning. This ensures that the model observes consistent, stepwise solution processes across extended contexts.
The SFT objective follows the standard autoregressive LLM loss applied over long sequences: where are output tokens and are conditioning contexts. Direct exposure to high-difficulty and even imperfect demonstrations is maintained; the inclusion of hard or partially incorrect samples introduces critical contrastive learning signals that increase the model’s discrimination between faulty vs. valid reasoning patterns.
Empirically, the SFT phase alone establishes a strong performance baseline, instilling the model with structured long-form reasoning strategies that can be further refined by reinforcement learning.
2. Gradient-Preserving Clipping Policy Optimization (GPPO) in Reinforcement Learning
The RL stage is implemented via GPPO, an innovation over Proximal Policy Optimization and its variants. Traditional PPO applies a clipping mechanism to the policy update ratio to maintain a “trust region,” but this introduces two crucial issues for reasoning models:
- Entropic, high-variance tokens—often where critical exploration is needed—get clipped harshly, diminishing exploration.
- Negative (suboptimal) trajectories offer no gradient update when their importance ratio falls below the clipping threshold, discarding useful corrective signals.
To address this, GPPO modifies the backward pass: even when the forward loss uses a clipped , the backward (gradient) computation preserves a bounded, nonzero gradient for clipped tokens, especially for negative examples. The loss is defined as: with as (normalized) group advantage, and as the clipping thresholds, and the stop-gradient operator.
The resulting gradient: where is set to or when outside respective bounds and is negative or positive, otherwise . This ensures that both positive exploratory and negative corrective signals persist, stabilizing and accelerating RL convergence on long reasoning chains.
3. Joint Objective and RL Workflow
The RL minimization combines the GPPO loss with the SFT loss, controlled by a coefficient : Ablation experiments indicate optimal performance at , allowing SFT supervision to regularize RL updates, mitigate reward hacking, and preserve general LLMing capacities.
The RL process further employs:
- Soft reward shaping (graded by test case pass rates), which yields more stable gradients than hard binary rewards;
- Filtering of zero-advantage groups to focus learning updates on informative samples.
GPPO is contrasted with other contemporary approaches such as Clip-Higher and Cautiously Increasing Stepwise Policy Optimization (CISPO); GPPO is found to preserve gradient flow and enhance robust convergence, while employing a pessimistic update that accepts negative feedback but avoids instability from over-optimistic changes.
4. Data Selection, Supervision Sources, and Generalization
Selection of high-quality, consistent demonstrations—preferably generated by strong teacher models (e.g., DeepSeek-R1-0528)—is found to be more beneficial than merely increasing data diversity or quantity. For easy tasks, filtering on correctness is useful, yet for difficult problems, retaining unfiltered, even imperfect, CoT traces assists model calibration by providing essential negative examples—this facilitates learning when dealing with challenging, multi-step reasoning requirements.
Empirical analysis reveals that using higher-quality (stronger) teacher models in SFT correlates with improved benchmark results. Filtering strategies on both data and reward signals further fine-tune model reliability and reduce convergence times.
5. Empirical Performance on Reasoning Benchmarks
Klear-Reasoner demonstrates strong performance on symbolic reasoning and program synthesis tasks, including:
- AIME 2024: 90.5% accuracy (64K inference budget).
- AIME 2025: 83.2% accuracy (64K budget).
- LiveCodeBench V5: 66.0% pass rate.
- LiveCodeBench V6: 58.1% pass rate.
These figures are achieved with a base model trained only on 32K context windows, underscoring the effect of long CoT SFT, gradient-preserving exploration, and large inference budgets on robust reasoning and long-range chain maintenance.
6. Ablations and Methodological Insights
Ablation studies elucidate several key findings:
- Inclusion of non-perfect (partially incorrect) hard samples aids exploration for complex reasoning tasks.
- The GPPO mechanism’s gradient-preserving behavior enables the model to exploit negative RL signals, contrasting with conventional methods that suppress these trajectories.
- Soft reward shaping and the filtering out groups with zero advantage further stabilize dynamics and accelerate learning curves.
- Joint tuning with the SFT loss acts as an anchor, striking a balance between exploration and safe LLMing continuity.
7. Significance and Implications
Klear-Reasoner’s methodological advances resolve barriers in both reproducibility and scaling of LLM-based reasoning. By transparently documenting the RL pipeline, dataset curation, and gradient-handling strategies, the risk of irreproducible, opaque performance claims is mitigated. The combination of structured CoT induction, robust RL exploration via GPPO, and nuanced ablation-driven refinement positions Klear-Reasoner as a leading architecture for mathematical and logical reasoning applications.
A plausible implication is that future reasoning models should preferentially adopt gradient-preserving RL strategies to unlock learning signals from both positive and negative exploration, and that careful curation—not sheer data scale—remains central for sophisticated long-form reasoning under supervision and reinforcement. These results set a precedent for subsequent work in efficient, scalable, and robust reasoning model development (Su et al., 11 Aug 2025).