Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reasoning-Focused Supervised Fine-Tuning

Updated 16 January 2026
  • Reasoning-Focused SFT is a supervised protocol that employs explicit multi-step chain-of-thought demonstrations to improve model reasoning abilities.
  • It enhances sample efficiency and transfer across varied tasks such as mathematical, visual question answering, and code generation.
  • Advanced techniques like Critical Token Fine-Tuning and entropy-aware weighting optimize convergence and mitigate issues like reward hacking.

Reasoning-Focused Supervised Fine-Tuning (SFT) refers to a suite of supervised adaptation protocols which use explicit, high-quality demonstrations of multi-step reasoning to endow large language and multimodal models with robust deductive, inductive, or procedural reasoning abilities. Rather than focusing solely on learning answer distributions or surface instruction-following, reasoning-focused SFT leverages chain-of-thought (CoT) traces, structured rationales, or annotated intermediate steps to encourage compositional skill, transfer to out-of-distribution tasks, and resilience to reward gaming. This paradigm has established itself as a critical component in state-of-the-art pipelines for mathematical, scientific, and visual reasoning, especially when model capacity, data resources, and domain transfer are limiting factors.

1. Definition and Formalism

The canonical reasoning-focused SFT objective is next-token cross-entropy over chain-of-thought demonstrations. For models with token-level policy πθ(ox)\pi_\theta(o | x) and dataset D={(x(i),y(i))}\mathcal{D} = \{(x^{(i)}, y^{(i)})\}, each y(i)y^{(i)} is an expert-generated trace—often formatted as > …<answer>…</answer>. The loss is

LSFT(θ)=1Ni=1Nt=1y(i)logπθ(yt(i)x(i),y<t(i))\mathcal{L}_{\mathrm{SFT}}(\theta) = -\frac{1}{N} \sum_{i=1}^N \sum_{t=1}^{|y^{(i)}|} \log \pi_\theta \left(y_t^{(i)} | x^{(i)}, y_{<t}^{(i)} \right)

This enforces pure imitation learning of the full, stepwise reasoning trajectory, without any extrinsic reward or preference signal (Yu et al., 14 Dec 2025). SFT is typically implemented as teacher-forcing with cross-entropy, but recent extensions use weightings based on entropy, demonstration quality, or critical-token identification.

2. Data Regimes, Architectures, and Task Scope

Research demonstrates that reasoning-focused SFT is highly sample-efficient for lightweight and mid-sized models and across a variety of architectures, including:

  • Vision-LLMs (Qwen2-VL-2B/3B/7B, Qwen2.5-VL-3B/7B)
  • Text-only LLMs (Qwen-family, LLaMA, OLMo, Flan-T5)
  • Multimodal models with bespoke vision encoders and transformer-based decoders (Yu et al., 14 Dec 2025, Tan et al., 26 Mar 2025)

SFT datasets consist of curated CoT reasoning pairs, ranging from a few thousand to tens of thousands of examples (e.g., 2K2\mathrm{K} SFT matches 20K20\mathrm{K} RL in VLMs), and may include domain-specific structured annotation (23-step schemas, explicit reasoning tags) (Dong et al., 25 Jun 2025, Pang et al., 14 Oct 2025). SFT is effective across arithmetic, symbolic reasoning, code generation, visual question answering, medical QA, and hybrid tasks.

3. Mechanistic Principles, Extensions, and Training Algorithms

Reasoning-focused SFT operates as a coarse-grained, global policy regularizer—sharpening the probability of demonstration tokens and lowering the entropy over the full sequence (Fu et al., 24 Jun 2025). Core techniques include:

  • Critical Token Fine-Tuning (CFT): Gradients are applied only at functionally decisive tokens, identified via counterfactual intervention and answer verification; this leads to faster convergence, superior pass@N, and improved RL initialization (Ruan et al., 13 Oct 2025).
  • Anchored Supervised Fine-Tuning (ASFT): The SFT loss is augmented with a token-level Kullback-Leibler (KL) divergence to a fixed base model, anchoring the policy and preventing runaway distributional drift (Zhu et al., 28 Sep 2025).
  • Structured Reasoning SFT: Chains are explicitly tagged (<summarize>, <formalize>, etc.), enforcing the exposure of latent procedural structure and enabling stepwise attention and reward diagnostics (Dong et al., 25 Jun 2025).
  • Pattern-Aware Rationale Autoannotation: For patterned-reasoning tasks, LLMs are prompted to generate CoT traces using a fixed step schema, eliminating costly human annotation and enabling near-equivalent performance with 10× less supervision (Pang et al., 14 Oct 2025).
  • Entropy-aware Weighting: The weight on the SFT (demonstration) loss is dynamically adapted as a function of model entropy. High entropy signals uncertainty, reducing the SFT gradient magnitude to preserve plasticity and avoid overwhelming with noisy updates, as in SRFT (Fu et al., 24 Jun 2025).
  • Preference Optimization: Direct preference objectives (e.g., ThinkPO) can further bias the model toward longer, more elaborate reasoning outputs by constructing preference pairs of long vs. short CoTs (Yang et al., 17 Feb 2025).

4. Empirical Characterization and Impact

A recurring finding is the data efficiency and foundational importance of reasoning SFT, particularly for small/medium models and low-resource domains. On VLMs with 7\leq 7B parameters, SFT yields +2–3% absolute accuracy gains over RL (GRPO) and enables downstream RL to avoid reward collapse and overfitting (Yu et al., 14 Dec 2025). Benchmark results:

Model / Setup Reasoning Task SFT Only (%) RL Only (%) SFT+RL (%) SOTA Method
Qwen2.5-Math-7B Math Comp (InD avg) 54.3 49.3 55.5 SRFT 59.1
Qwen2-VL-7B MathVision (VLM) 28.7 25.7 31.4 SFT+RL
DeepSeek-Qwen-7B MATH500 87.4 91.2 ThinkPO

SFT also provides robust cross-modal transfer: vision-LLMs fine-tuned with reasoning-focused SFT achieve +2–4% improvement on pure-text reasoning benchmarks and maintain language understanding without harming generality (Yu et al., 14 Dec 2025). Empirical studies with meta-probing benchmarks show SFT narrows capability profiles (diagnostic spike, fact retrieval collapse) but may hinder transfer if misapplied or unregularized (Bai et al., 30 Dec 2025).

5. Best Practices, Data Selection, and Integration with RL

Current literature converges on several practical guidelines:

6. Limitations, Variants, and Open Research Directions

Although reasoning-focused SFT is essential for initializing multi-step reasoning, several limitations are recognized:

  • Standard SFT can overfit to surface patterns, narrowing the model’s reasoning skill profile and impairing generalization to new tasks (Bai et al., 30 Dec 2025).
  • Uniform token-level cross-entropy penalizes all positions, undermining output diversity and exploration unless methods such as CFT or entropy-aware masking are applied (Ruan et al., 13 Oct 2025, Yuan et al., 6 Oct 2025).
  • In out-of-domain or reward-misaligned regimes, SFT without anchoring can drift, reducing transfer (Zhu et al., 28 Sep 2025).
  • Human rationale annotation is costly; although patterned tasks allow for pattern-aware autoannotation (PARO), adaptive tasks still require significant manual input (Pang et al., 14 Oct 2025).

Active research explores dynamic and plug-in SFT mechanisms (e.g., MIFO), reward-weighted regression and auxiliary loss scaling, curriculum- and capability-aware SFT, and single-stage SFT–RL hybridization to maximize both efficiency and transfer (Yuan et al., 6 Oct 2025, Fu et al., 24 Jun 2025, Dong et al., 25 Jun 2025).

7. Contributions to the Broader Reasoning Pipeline

Reasoning-focused SFT underpins nearly all state-of-the-art reasoning development pipelines, serving both as a robust pre-training for RL and as a practical standalone approach for resource-constrained settings. It is foundational for domain adaptation, cross-modal transfer, and structuring model outputs for interpretability, and its extensions (critical-token masking, KL-anchoring, preference optimization) enable fine-grained control over reasoning capacity and knowledge retention (Ruan et al., 13 Oct 2025, Zhu et al., 28 Sep 2025, Yang et al., 17 Feb 2025). As LLM capabilities and deployment modalities broaden, SFT’s role as a flexible reservoir of compositional reasoning priors is likely to persist and expand.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reasoning-Focused Supervised Fine-Tuning (SFT).