Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 199 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Reasoning SFT Stage Overview

Updated 30 October 2025
  • Reasoning SFT Stage is the phase where models are fine-tuned with explicit chain-of-thought data to develop structured and interpretable reasoning abilities.
  • It employs supervised fine-tuning using annotated reasoning traces that bias models toward stepwise explanations and effective policy initialization for RL.
  • The stage leverages diverse, high-quality datasets and teacher-forcing methods to enhance accuracy, generalization, and downstream performance.

A Reasoning SFT (Supervised Fine-Tuning) Stage denotes the phase in LLM or vision-LLM (VLM) training where models are fine-tuned using datasets containing explicit reasoning demonstrations—often structured as natural language chain-of-thought (CoT) traces, stepwise rationales, or formal solution procedures. The goal is to directly impart procedural and compositional reasoning by maximizing the likelihood of human-annotated, interpretable reasoning trajectories. The Reasoning SFT stage is now a principal component in modern instruction-following, mathematical, scientific, and multimodal LLMs, and its methodology, limitations, and interactions with subsequent reinforcement-based stages define the effectiveness and generalizability of advanced reasoning models.

1. Core Principles and Objectives

The Reasoning SFT stage follows unsupervised pretraining or multi-modal pretraining and is designed with several objectives:

  1. Instruction/Format Alignment: By training on large collections of (prompt, reasoning, answer) pairs, models are aligned to produce human-like, interpretable reasoning traces—often using a fixed response format such as
    1
    
    <think> ... reasoning ... </think> <answer> ... </answer>
    (Tan et al., 26 Mar 2025, Li et al., 19 Jul 2025, Sun et al., 16 Apr 2025, Wang et al., 16 Oct 2025).
  2. Inductive Bias Towards Structured Reasoning: Explicit reasoning supervision biases models to produce stepwise decompositions and explanations, rather than single-step or surface-level answers (Tan et al., 26 Mar 2025, Sun et al., 16 Apr 2025, Pang et al., 14 Oct 2025).
  3. Activation of Latent Reasoning Capabilities: SFT “activates” the model’s inherent, but unexpressed, reasoning, providing the policy initialization necessary for efficient reinforcement learning or downstream reward-driven adaptation (Tan et al., 26 Mar 2025, Ou, 3 Sep 2025).
  4. Teacher-Forcing with Cross-Entropy Loss: The training objective is the token-level conditional log-likelihood:

LSFT(πθ)=i=1Nlogπθ(yix,y<i)\mathcal{L}_\mathrm{SFT}(\pi_\theta) = - \sum_{i=1}^N \log \pi_\theta(y_i \mid x, y_{<i})

with teacher-forcing on the correct tokens throughout each training sequence (Wang et al., 16 Oct 2025, Li et al., 19 Jul 2025).

2. Methodologies and Dataset Construction

Datasets and implementation details crucially shape the efficacy of the SFT stage.

  • Data Sources and Annotation:
  • Reasoning Trace Structure:
    • Detailed, stepwise, and modular (e.g., in mathematical SFT: multi-step deduction, verification, error-checking; in VLMs: visually grounded, CoT explanations).
    • In E2CE^2C-style paradigms, SFT data is decomposed into (plan,execution)(\text{plan}, \text{execution}) with enforced plan-adherence during execution (Yang et al., 28 Sep 2025).
  • Size and Diversity:
  • Quality Control and Filtering:

3. Empirical Effectiveness and Limitations

The direct effects (measured in diverse benchmarks) and remaining challenges of reasoning SFT are:

  • Accuracy and Generalization:
    • Enables significant accuracy gains over pretraining-only models on medium-difficulty reasoning (e.g., math, code) with even small-scale SFT on R1-style traces (Sun et al., 16 Apr 2025).
    • For “patterned reasoning tasks,” reasoning pattern exposure suffices—annotation scale becomes sub-dominant (Pang et al., 14 Oct 2025).
  • Trajectory Expansion vs. Compression:
    • SFT increases the diversity and number of correct reasoning trajectories (“trajectory expansion”) (Matsutani et al., 25 Sep 2025).
    • However, it also preserves diverse incorrect trajectories, so pass@1 accuracy may not improve without RL, but best-of-kk performance rises (Matsutani et al., 25 Sep 2025).
  • Foundational for RL/Reward Optimization:
  • Overfitting and OOD Forgetting:
    • OOD reasoning peaks early during SFT and then declines with further training—a form of OOD forgetting that is not visible in standard validation loss (Jin et al., 8 Sep 2025).
    • Heavy/extended SFT can “lock in” imitative, rigid reasoning patterns, especially in vision-LLMs, making further RL less effective or even detrimental (Chen et al., 10 Apr 2025, Guan et al., 15 Aug 2025).
  • Expressivity Gaps:
    • For very small models, SFT may be harmful if expert traces are too complex—SLMs may not even be able to imitate them, impeding subsequent RL (Zhang et al., 20 Jun 2025).

4. Comparisons and Integration with Reinforcement Learning

SFT and RL have distinct, often complementary, influences on model reasoning:

  • Exploration vs. Determinism:
    • SFT’s teacher-forcing precludes exploration—the model only sees ground-truth continuations, so cannot discover alternative paths or correct errors outside the labeled space (Wang et al., 16 Oct 2025).
    • RL introduces exploration by rewarding semantically correct but structurally diverse solutions—even if not identical to SFT labels.
  • Policy Distribution Effects:
  • Two-Stage and Single-Stage Schemes:
  • Quantitative Gains:
    • RL (e.g., RLSR) with semantic rewards outperforms SFT on instruction following; hybrid SFT+RLSR further boosts open-ended, generative performance (AlpacaEval win rates: SFT 21.0%, RLSR 26.3%, SFT+RLSR 30.7% on Qwen-7B) (Wang et al., 16 Oct 2025).

5. Data Design, Pattern Supervision, and Emerging Paradigms

Data design strongly determines SFT effectiveness, generalization, and scalability.

  • Patterned Tasks and Rationale Automation:
    • For tasks with a fixed procedural pattern (classification, verification), minimal annotated rationales plus pattern-guided LLM rationale generation suffice (PARO: SFT+RLVR with LLM-generated rationales matches 10×\times human rationale scale) (Pang et al., 14 Oct 2025).
  • Plan-Execution Decoupling:
    • SFT can be structured to output plans and executions in a causally-adhered, separated manner (E2CE^2C), enabling far more efficient reasoning, cross-domain transfer, and interpretability (Yang et al., 28 Sep 2025).
  • Selective and Efficient Reasoning:
    • SFT methodology can encode selective behaviors (e.g., “thought dropout” to allow the model to skip reasoning on easy problems, unlocking significant efficiency in VLMs) (Wang et al., 22 May 2025).
  • Data Quality over Quantity:
    • For mathematical and code reasoning, SFT dataset quality—i.e., the correctness, clarity, and CoT depth of reasoning traces—trumps raw scale; scaling SFT in low quality, mixed data regimes may harm reasoning (Akter et al., 26 Sep 2025, Sun et al., 16 Apr 2025).

6. Limitations and Best Practices

  • Ceiling Effects:
    • SFT alone propagates reasoning up to a “Hard” tier of task difficulty, with accuracy plateauing (e.g., \sim65% for hard AIME24 problems), regardless of further SFT scaling or curation (Sun et al., 16 Apr 2025).
    • Exceptional, out-of-domain, or “unconventional” problem solving generally requires new architectural, program-augmented, or externally conditioned training paradigms.
  • SFT as Foundation, Not Panacea:
  • Ongoing Directions:

Summary Table: SFT Stage Roles and Trade-offs

Aspect SFT Stage Contribution Limitation/Trade-off
Alignment/Format Strong task and format imitation Rigid reasoning, limited exploration
Reasoning Expansion Broadens correct solutions space Retains diverse errors, not precise enough
Generalization OOD Peaks early SFT, then declines OOD forgetting with continued SFT
RL Foundation Initializes stable policy for RL Can anchor model, reduce RL capacity
Data Requirements Quality of CoT/plan pattern key Scale alone can plateau or harm

The Reasoning SFT stage thus remains indispensable for imparting interpretable, compositional reasoning to advanced LLMs and VLMs, but must be carefully designed for task structure, data scale/quality, procedural pattern representation, and later integration with reward-driven optimization to yield robust, domain-adaptive, and generalizable reasoning models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Reasoning SFT Stage.