Reasoning SFT Stage Overview
- Reasoning SFT Stage is the phase where models are fine-tuned with explicit chain-of-thought data to develop structured and interpretable reasoning abilities.
- It employs supervised fine-tuning using annotated reasoning traces that bias models toward stepwise explanations and effective policy initialization for RL.
- The stage leverages diverse, high-quality datasets and teacher-forcing methods to enhance accuracy, generalization, and downstream performance.
A Reasoning SFT (Supervised Fine-Tuning) Stage denotes the phase in LLM or vision-LLM (VLM) training where models are fine-tuned using datasets containing explicit reasoning demonstrations—often structured as natural language chain-of-thought (CoT) traces, stepwise rationales, or formal solution procedures. The goal is to directly impart procedural and compositional reasoning by maximizing the likelihood of human-annotated, interpretable reasoning trajectories. The Reasoning SFT stage is now a principal component in modern instruction-following, mathematical, scientific, and multimodal LLMs, and its methodology, limitations, and interactions with subsequent reinforcement-based stages define the effectiveness and generalizability of advanced reasoning models.
1. Core Principles and Objectives
The Reasoning SFT stage follows unsupervised pretraining or multi-modal pretraining and is designed with several objectives:
- Instruction/Format Alignment: By training on large collections of (prompt, reasoning, answer) pairs, models are aligned to produce human-like, interpretable reasoning traces—often using a fixed response format such as
(Tan et al., 26 Mar 2025, Li et al., 19 Jul 2025, Sun et al., 16 Apr 2025, Wang et al., 16 Oct 2025).1
<think> ... reasoning ... </think> <answer> ... </answer>
- Inductive Bias Towards Structured Reasoning: Explicit reasoning supervision biases models to produce stepwise decompositions and explanations, rather than single-step or surface-level answers (Tan et al., 26 Mar 2025, Sun et al., 16 Apr 2025, Pang et al., 14 Oct 2025).
- Activation of Latent Reasoning Capabilities: SFT “activates” the model’s inherent, but unexpressed, reasoning, providing the policy initialization necessary for efficient reinforcement learning or downstream reward-driven adaptation (Tan et al., 26 Mar 2025, Ou, 3 Sep 2025).
- Teacher-Forcing with Cross-Entropy Loss: The training objective is the token-level conditional log-likelihood:
with teacher-forcing on the correct tokens throughout each training sequence (Wang et al., 16 Oct 2025, Li et al., 19 Jul 2025).
2. Methodologies and Dataset Construction
Datasets and implementation details crucially shape the efficacy of the SFT stage.
- Data Sources and Annotation:
- Human-annotated: Curated, high-quality chain-of-thought traces, rationales, or deliberative answers, often costly to annotate at scale (Tan et al., 26 Mar 2025, Pang et al., 14 Oct 2025).
- Synthetically generated: LLMs (using prompt templates or bootstrapping) generate chain-of-thought for new inputs; these are filtered or verified before use (Li et al., 19 Jul 2025, Wang et al., 16 Oct 2025, Sun et al., 16 Apr 2025).
- Pattern-based: For patterned reasoning tasks, it suffices to annotate generic reasoning patterns and use LLMs to generate rationales according to the template (PARO), enabling annotation scaling with minimal human labor (Pang et al., 14 Oct 2025).
- Reasoning Trace Structure:
- Detailed, stepwise, and modular (e.g., in mathematical SFT: multi-step deduction, verification, error-checking; in VLMs: visually grounded, CoT explanations).
- In -style paradigms, SFT data is decomposed into with enforced plan-adherence during execution (Yang et al., 28 Sep 2025).
- Size and Diversity:
- Reasoning performance sharply increases with SFT dataset scale, especially for complex problems (Sun et al., 16 Apr 2025, Yoshihara et al., 11 Jul 2025).
- For mathematical LLMs, extended SFT (8–10 epochs) and large, diverse corpus (e.g., 719K+ examples in MiroMind-M1) are critical for state-of-the-art accuracy (Li et al., 19 Jul 2025, Yoshihara et al., 11 Jul 2025).
- Quality Control and Filtering:
- Aggressive filtering by reward models and rule-based validation, with decontamination from test sets, is now standard (Li et al., 19 Jul 2025, Hao et al., 26 Jul 2025).
- Token-packing can be eschewed for no-packing (better for long CoT) (Li et al., 19 Jul 2025).
3. Empirical Effectiveness and Limitations
The direct effects (measured in diverse benchmarks) and remaining challenges of reasoning SFT are:
- Accuracy and Generalization:
- Enables significant accuracy gains over pretraining-only models on medium-difficulty reasoning (e.g., math, code) with even small-scale SFT on R1-style traces (Sun et al., 16 Apr 2025).
- For “patterned reasoning tasks,” reasoning pattern exposure suffices—annotation scale becomes sub-dominant (Pang et al., 14 Oct 2025).
- Trajectory Expansion vs. Compression:
- SFT increases the diversity and number of correct reasoning trajectories (“trajectory expansion”) (Matsutani et al., 25 Sep 2025).
- However, it also preserves diverse incorrect trajectories, so pass@1 accuracy may not improve without RL, but best-of- performance rises (Matsutani et al., 25 Sep 2025).
- Foundational for RL/Reward Optimization:
- SFT is necessary for stable and sample-efficient RL by providing an effective initialization that avoids reward sparsity and degenerate optimization (cold-start RL is generally much less effective) (Tan et al., 26 Mar 2025, Li et al., 19 Jul 2025, Ou, 3 Sep 2025, Yoshihara et al., 11 Jul 2025).
- Overfitting and OOD Forgetting:
- OOD reasoning peaks early during SFT and then declines with further training—a form of OOD forgetting that is not visible in standard validation loss (Jin et al., 8 Sep 2025).
- Heavy/extended SFT can “lock in” imitative, rigid reasoning patterns, especially in vision-LLMs, making further RL less effective or even detrimental (Chen et al., 10 Apr 2025, Guan et al., 15 Aug 2025).
- Expressivity Gaps:
- For very small models, SFT may be harmful if expert traces are too complex—SLMs may not even be able to imitate them, impeding subsequent RL (Zhang et al., 20 Jun 2025).
4. Comparisons and Integration with Reinforcement Learning
SFT and RL have distinct, often complementary, influences on model reasoning:
- Exploration vs. Determinism:
- SFT’s teacher-forcing precludes exploration—the model only sees ground-truth continuations, so cannot discover alternative paths or correct errors outside the labeled space (Wang et al., 16 Oct 2025).
- RL introduces exploration by rewarding semantically correct but structurally diverse solutions—even if not identical to SFT labels.
- Policy Distribution Effects:
- SFT acts as a “sledgehammer,” globally shrinking output entropy and enforcing high probability on target tokens everywhere (Fu et al., 24 Jun 2025, Matsutani et al., 25 Sep 2025).
- RL acts as a “scalpel,” modifying only high-entropy (uncertain) states and selectively refining the policy (Fu et al., 24 Jun 2025).
- Two-Stage and Single-Stage Schemes:
- Most strong models employ two-stage SFTRL pipelines for best performance (SFT for capacity, RL for optimality and robustness) (Yoshihara et al., 11 Jul 2025, Wang et al., 16 Oct 2025, Tan et al., 26 Mar 2025, Ou, 3 Sep 2025).
- Hybrid/single-stage approaches (SRFT) dynamically weight SFT and RL losses based on entropy, seeking to balance global structure with local exploration (Fu et al., 24 Jun 2025).
- Quantitative Gains:
- RL (e.g., RLSR) with semantic rewards outperforms SFT on instruction following; hybrid SFT+RLSR further boosts open-ended, generative performance (AlpacaEval win rates: SFT 21.0%, RLSR 26.3%, SFT+RLSR 30.7% on Qwen-7B) (Wang et al., 16 Oct 2025).
5. Data Design, Pattern Supervision, and Emerging Paradigms
Data design strongly determines SFT effectiveness, generalization, and scalability.
- Patterned Tasks and Rationale Automation:
- For tasks with a fixed procedural pattern (classification, verification), minimal annotated rationales plus pattern-guided LLM rationale generation suffice (PARO: SFT+RLVR with LLM-generated rationales matches 10 human rationale scale) (Pang et al., 14 Oct 2025).
- Plan-Execution Decoupling:
- SFT can be structured to output plans and executions in a causally-adhered, separated manner (), enabling far more efficient reasoning, cross-domain transfer, and interpretability (Yang et al., 28 Sep 2025).
- Selective and Efficient Reasoning:
- SFT methodology can encode selective behaviors (e.g., “thought dropout” to allow the model to skip reasoning on easy problems, unlocking significant efficiency in VLMs) (Wang et al., 22 May 2025).
- Data Quality over Quantity:
- For mathematical and code reasoning, SFT dataset quality—i.e., the correctness, clarity, and CoT depth of reasoning traces—trumps raw scale; scaling SFT in low quality, mixed data regimes may harm reasoning (Akter et al., 26 Sep 2025, Sun et al., 16 Apr 2025).
6. Limitations and Best Practices
- Ceiling Effects:
- SFT alone propagates reasoning up to a “Hard” tier of task difficulty, with accuracy plateauing (e.g., 65% for hard AIME24 problems), regardless of further SFT scaling or curation (Sun et al., 16 Apr 2025).
- Exceptional, out-of-domain, or “unconventional” problem solving generally requires new architectural, program-augmented, or externally conditioned training paradigms.
- SFT as Foundation, Not Panacea:
- SFT is necessary but not sufficient for state-of-the-art reasoning: without subsequent RL, semantic reward optimization, or plan-execute architecture, SFT-imparted skills are limited to imitative, rigid, or failure-prone modes on complex, compositional reasoning (Wang et al., 16 Oct 2025, Matsutani et al., 25 Sep 2025, Chen et al., 10 Apr 2025).
- Ongoing Directions:
- Optimization of SFT trajectories for variety and coverage, rather than correctness alone, and closer integration of entropy diagnostics to regulate SFT/RL phases are emerging as effective methodologies (Matsutani et al., 25 Sep 2025, Fu et al., 24 Jun 2025).
- Pattern-based rationale annotation, causally-structured plan-execute data generation, and difficulty stratification via teacher models (for ZPD) are shaping future scalable, automated SFT pipelines (Yang et al., 28 Sep 2025, Pang et al., 14 Oct 2025, Ou, 3 Sep 2025).
Summary Table: SFT Stage Roles and Trade-offs
| Aspect | SFT Stage Contribution | Limitation/Trade-off |
|---|---|---|
| Alignment/Format | Strong task and format imitation | Rigid reasoning, limited exploration |
| Reasoning Expansion | Broadens correct solutions space | Retains diverse errors, not precise enough |
| Generalization OOD | Peaks early SFT, then declines | OOD forgetting with continued SFT |
| RL Foundation | Initializes stable policy for RL | Can anchor model, reduce RL capacity |
| Data Requirements | Quality of CoT/plan pattern key | Scale alone can plateau or harm |
The Reasoning SFT stage thus remains indispensable for imparting interpretable, compositional reasoning to advanced LLMs and VLMs, but must be carefully designed for task structure, data scale/quality, procedural pattern representation, and later integration with reward-driven optimization to yield robust, domain-adaptive, and generalizable reasoning models.