Reasoning-Focused Supervised Fine-Tuning

Updated 16 January 2026

Reasoning-Focused SFT is a supervised protocol that employs explicit multi-step chain-of-thought demonstrations to improve model reasoning abilities.
It enhances sample efficiency and transfer across varied tasks such as mathematical, visual question answering, and code generation.
Advanced techniques like Critical Token Fine-Tuning and entropy-aware weighting optimize convergence and mitigate issues like reward hacking.

Reasoning-Focused Supervised Fine-Tuning (SFT) refers to a suite of supervised adaptation protocols which use explicit, high-quality demonstrations of multi-step reasoning to endow large language and multimodal models with robust deductive, inductive, or procedural reasoning abilities. Rather than focusing solely on learning answer distributions or surface instruction-following, reasoning-focused SFT leverages chain-of-thought (CoT) traces, structured rationales, or annotated intermediate steps to encourage compositional skill, transfer to out-of-distribution tasks, and resilience to reward gaming. This paradigm has established itself as a critical component in state-of-the-art pipelines for mathematical, scientific, and visual reasoning, especially when model capacity, data resources, and domain transfer are limiting factors.

1. Definition and Formalism

The canonical reasoning-focused SFT objective is next-token cross-entropy over chain-of-thought demonstrations. For models with token-level policy $\pi_\theta(o | x)$ and dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}$ , each $y^{(i)}$ is an expert-generated trace—often formatted as > …<answer>…</answer>. The loss is

$\mathcal{L}_{\mathrm{SFT}}(\theta) = -\frac{1}{N} \sum_{i=1}^N \sum_{t=1}^{|y^{(i)}|} \log \pi_\theta \left(y_t^{(i)} | x^{(i)}, y_{<t}^{(i)} \right)$

This enforces pure imitation learning of the full, stepwise reasoning trajectory, without any extrinsic reward or preference signal (Yu et al., 14 Dec 2025). SFT is typically implemented as teacher-forcing with cross-entropy, but recent extensions use weightings based on entropy, demonstration quality, or critical-token identification.

2. Data Regimes, Architectures, and Task Scope

Research demonstrates that reasoning-focused SFT is highly sample-efficient for lightweight and mid-sized models and across a variety of architectures, including:

Vision-LLMs (Qwen2-VL-2B/3B/7B, Qwen2.5-VL-3B/7B)
Text-only LLMs (Qwen-family, LLaMA, OLMo, Flan-T5)
Multimodal models with bespoke vision encoders and transformer-based decoders (Yu et al., 14 Dec 2025, Tan et al., 26 Mar 2025)

SFT datasets consist of curated CoT reasoning pairs, ranging from a few thousand to tens of thousands of examples (e.g., $2\mathrm{K}$ SFT matches $20\mathrm{K}$ RL in VLMs), and may include domain-specific structured annotation (23-step schemas, explicit reasoning tags) (Dong et al., 25 Jun 2025, Pang et al., 14 Oct 2025). SFT is effective across arithmetic, symbolic reasoning, code generation, visual question answering, medical QA, and hybrid tasks.

3. Mechanistic Principles, Extensions, and Training Algorithms

Reasoning-focused SFT operates as a coarse-grained, global policy regularizer—sharpening the probability of demonstration tokens and lowering the entropy over the full sequence (Fu et al., 24 Jun 2025). Core techniques include:

Critical Token Fine-Tuning (CFT): Gradients are applied only at functionally decisive tokens, identified via counterfactual intervention and answer verification; this leads to faster convergence, superior pass@N, and improved RL initialization (Ruan et al., 13 Oct 2025).
Anchored Supervised Fine-Tuning (ASFT): The SFT loss is augmented with a token-level Kullback-Leibler (KL) divergence to a fixed base model, anchoring the policy and preventing runaway distributional drift (Zhu et al., 28 Sep 2025).
Structured Reasoning SFT: Chains are explicitly tagged (<summarize>, <formalize>, etc.), enforcing the exposure of latent procedural structure and enabling stepwise attention and reward diagnostics (Dong et al., 25 Jun 2025).
Pattern-Aware Rationale Autoannotation: For patterned-reasoning tasks, LLMs are prompted to generate CoT traces using a fixed step schema, eliminating costly human annotation and enabling near-equivalent performance with 10× less supervision (Pang et al., 14 Oct 2025).
Entropy-aware Weighting: The weight on the SFT (demonstration) loss is dynamically adapted as a function of model entropy. High entropy signals uncertainty, reducing the SFT gradient magnitude to preserve plasticity and avoid overwhelming with noisy updates, as in SRFT (Fu et al., 24 Jun 2025).
Preference Optimization: Direct preference objectives (e.g., ThinkPO) can further bias the model toward longer, more elaborate reasoning outputs by constructing preference pairs of long vs. short CoTs (Yang et al., 17 Feb 2025).

4. Empirical Characterization and Impact

A recurring finding is the data efficiency and foundational importance of reasoning SFT, particularly for small/medium models and low-resource domains. On VLMs with $\leq 7$ B parameters, SFT yields +2–3% absolute accuracy gains over RL (GRPO) and enables downstream RL to avoid reward collapse and overfitting (Yu et al., 14 Dec 2025). Benchmark results:

Model / Setup	Reasoning Task	SFT Only (%)	RL Only (%)	SFT+RL (%)	SOTA Method
Qwen2.5-Math-7B	Math Comp (InD avg)	54.3	49.3	55.5	SRFT 59.1
Qwen2-VL-7B	MathVision (VLM)	28.7	25.7	31.4	SFT+RL
DeepSeek-Qwen-7B	MATH500	87.4	—	91.2	ThinkPO

SFT also provides robust cross-modal transfer: vision-LLMs fine-tuned with reasoning-focused SFT achieve +2–4% improvement on pure-text reasoning benchmarks and maintain language understanding without harming generality (Yu et al., 14 Dec 2025). Empirical studies with meta-probing benchmarks show SFT narrows capability profiles (diagnostic spike, fact retrieval collapse) but may hinder transfer if misapplied or unregularized (Bai et al., 30 Dec 2025).

5. Best Practices, Data Selection, and Integration with RL

Current literature converges on several practical guidelines:

SFT should precede RL when initializing reasoning skills, especially for weaker models or novel domains.
Use 2–5 $\mathrm{K}$ high-quality, domain-matched reasoning traces to maximize data efficiency; above 10 $\mathrm{K}$ , returns diminish (Yu et al., 14 Dec 2025).
Structure data with clear reason/answer demarcations, tokens, or explicit tagging to clarify induction targets (Dong et al., 25 Jun 2025).
For patterned tasks, employ LLM-based prompt engineering for rationale generation, leveraging pattern-aware annotation to minimize human labor (Pang et al., 14 Oct 2025).
Augment SFT with KL-anchoring, entropy-aware weighting, or critical-token selection to avoid memorization and optimize transfer (Zhu et al., 28 Sep 2025, Fu et al., 24 Jun 2025, Ruan et al., 13 Oct 2025).
When integrating with RL, always use SFT warm-starts, freeze RL-critical parameters during SFT interleaves, and select high-entropy (uncertain) tokens for loss application to avoid catastrophic forgetting (Yuan et al., 6 Oct 2025).
Regularly monitor both reward and held-out accuracy to identify reward hacking or over-optimization during RL (Yu et al., 14 Dec 2025).

6. Limitations, Variants, and Open Research Directions

Although reasoning-focused SFT is essential for initializing multi-step reasoning, several limitations are recognized:

Standard SFT can overfit to surface patterns, narrowing the model’s reasoning skill profile and impairing generalization to new tasks (Bai et al., 30 Dec 2025).
Uniform token-level cross-entropy penalizes all positions, undermining output diversity and exploration unless methods such as CFT or entropy-aware masking are applied (Ruan et al., 13 Oct 2025, Yuan et al., 6 Oct 2025).
In out-of-domain or reward-misaligned regimes, SFT without anchoring can drift, reducing transfer (Zhu et al., 28 Sep 2025).
Human rationale annotation is costly; although patterned tasks allow for pattern-aware autoannotation (PARO), adaptive tasks still require significant manual input (Pang et al., 14 Oct 2025).

Active research explores dynamic and plug-in SFT mechanisms (e.g., MIFO), reward-weighted regression and auxiliary loss scaling, curriculum- and capability-aware SFT, and single-stage SFT–RL hybridization to maximize both efficiency and transfer (Yuan et al., 6 Oct 2025, Fu et al., 24 Jun 2025, Dong et al., 25 Jun 2025).

7. Contributions to the Broader Reasoning Pipeline

Reasoning-focused SFT underpins nearly all state-of-the-art reasoning development pipelines, serving both as a robust pre-training for RL and as a practical standalone approach for resource-constrained settings. It is foundational for domain adaptation, cross-modal transfer, and structuring model outputs for interpretability, and its extensions (critical-token masking, KL-anchoring, preference optimization) enable fine-grained control over reasoning capacity and knowledge retention (Ruan et al., 13 Oct 2025, Zhu et al., 28 Sep 2025, Yang et al., 17 Feb 2025). As LLM capabilities and deployment modalities broaden, SFT’s role as a flexible reservoir of compositional reasoning priors is likely to persist and expand.