SFT: Supervised Fine Tuning Overview

Updated 7 September 2025

Supervised Fine Tuning (SFT) is a technique that refines pre-trained models using labeled demonstration pairs to improve specialized task performance.
It leverages token-level optimization and group-based updates to enhance data quality management and mitigate catastrophic forgetting.
SFT integrates with reinforcement learning and ensemble methods to achieve robust adaptation across language and vision applications.

Supervised Fine Tuning (SFT) is a foundational technique for post-training large models across vision and language domains, enabling the adaptation of a generic, pre-trained backbone to either specialized instruction adherence or enriched perceptual capabilities. The core methodology involves further updating model parameters using labeled datasets—usually comprising input–output demonstration pairs—which guide the model toward more desirable responses for end users or downstream tasks. Recent research explores not only refinements to the SFT process itself but also its connections to reinforcement learning, statistical efficiency, optimization dynamics, catastrophic forgetting, and data-centric strategies.

1. Paradigms, Objectives, and Theoretical Foundations

SFT is formalized as maximizing the log-likelihood of target (ground-truth) responses under the model’s policy. For a LLM $\pi_\theta$ , and a demonstration dataset $\mathcal{D} = \{(x_i, y_i)\}$ , the canonical objective is

$\mathcal{L}_{\text{SFT}}(\theta) = \mathbb{E}_{(x, y) \sim \mathcal{D}} [\log \pi_\theta(y|x)]$

This process is equivalent to behavior cloning in imitation learning and—under filtered data or with additional weighting schemes—a lower bound on the expected cumulative reward in a sparse RL setting (Qin et al., 17 Jul 2025). Specifically, if only successful or highly-ranked trajectories are kept, SFT can be shown to maximize a lower bound on

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]$

Recent work demonstrates that this connection can be exploited to derive tighter fidelity to reward signals, either through inverse RL based joint reward and policy learning (Li et al., 28 May 2024), importance-weighted SFT (Qin et al., 17 Jul 2025), or preference-oriented formulations leveraging baseline models as scoring references (Fan et al., 17 Dec 2024).

2. Data Selection, Quality, and Token-level Granularity

The quality, diversity, and style of SFT data exert a disproportionate influence on alignment and generalization:

Data selection: Studies show that selecting demonstration-response pairs with longer, more detailed outputs yields superior SFT models than using the full dataset or subsets selected by automated “quality” or “diversity” heuristics (Shen, 8 Feb 2024). This effect is linked to the tendency of SFT to impart conversational style rather than factual knowledge.
Token-level mechanisms: Recent work introduces token-wise quality scoring, partitioning the data into positive (informative) and negative (possibly misleading) tokens based on an influence score computed as the loss difference between a reference and base model (Ghahrizjani et al., 6 Aug 2025). Formally, for token $y_{i, j}$ , the score is

$\mathcal{Q}(y_{i, j} | x_{i, : j}; \theta, \theta') = -\left(\ell(y_{i, j} | x_{i, : j}; \theta') - \ell(y_{i, j} | x_{i, : j}; \theta)\right)$

Negative tokens are actively “forgotten” through a negative loss term in the final objective.

Group optimization: SFT-GO groups tokens by importance (statistics-based, semantics-based, or excess-loss–based) and combines standard cross-entropy with a worst-group loss. The objective, parameterized by $\lambda \in [0, 1]$ , is

$L_{\text{GO}}(w; \theta) = (1 - \lambda) L_{\text{CE}}(w; \theta) + \lambda L_{\text{worst}}(w; g, \theta)$

where $L_{\text{worst}}$ penalizes the group with the highest loss (Kim et al., 17 Jun 2025).

Robust denoising: RobustFT introduces multi-expert collaborative detection, iterative context-enhanced relabeling, and entropy-based selection to prune or relabel noisy samples prior to SFT (Luo et al., 19 Dec 2024).
Statistical efficiency: FisherSFT selects a subset of training data by maximizing the log-determinant of the Hessian of the log-likelihood with respect to the model’s last-layer parameters, reducing both mean and max prediction errors for a given budget (Deb et al., 20 May 2025).

3. Model Dynamics, Optimization Strategies, and Generalization

Fine-tuning dynamics in SFT are shaped by both the training protocol and the structure of model updates:

Layer-wise effects: Large-scale controlled studies (1,000+ models) reveal that performance gains correlate most strongly with targeted mid-layer updates, not maximum overall weight change (Harada et al., 17 Jun 2025). The expansion of model intrinsic dimensionality also originates in mid-layers, suggesting that the principal representational improvements occur at this depth.
Parameter isolation and merging: CPI-FT explicitly isolates “core” parameters most updated during task-specific fine-tuning and clusters tasks by overlap of these core regions (measured by Jaccard index), fusing task-specific parameters via direct transplantation or Spherical Linear Interpolation (SLERP). This approach alleviates the “seesaw” effect (where SFT for one task degrades another) and reduces catastrophic forgetting (Wang et al., 29 Aug 2025). Parameter-selection merging across models with different training orders helps overcome data order imbalance, outperforming weighted-averaging (Ju et al., 1 Oct 2024).
Ensemble strategies: To address overadaptation and catastrophic forgetting, ensemble methods interpolate between the pre-trained and fine-tuned solutions:

$\hat{\theta}^{\tau} = (1 - \tau) \hat{\theta}_1 + \tau \hat{\theta}$

where $\hat{\theta}_1$ is the pre-trained, and $\hat{\theta}$ the fine-tuned model (Hao et al., 2 Jun 2025). This interpolation yields improved bias–variance trade-off, often exceeding the performance of the fine-tuned model alone on both in-domain and general benchmarks.

Proximal objectives: PSFT introduces a clipped surrogate loss inspired by PPO, with an importance sampling ratio $r_t(\theta)$ constrained within $[1 - \epsilon, 1 + \epsilon]$ , stabilizing training and preserving policy entropy for downstream RL/further optimization (Zhu et al., 25 Aug 2025).

4. Integration of SFT with Reinforcement Learning and Preference Optimization

SFT’s connections to reinforcement learning are established both theoretically and through joint algorithms:

Unified and single-stage paradigms: UFT and SRFT propose hybrid objectives that blend supervised signals (memorizing expert traces or hints) with RL-derived rewards, using either curriculum (hint-guided, with cosine decay of supervision) or entropy-based weighting schemes (Liu et al., 22 May 2025, Fu et al., 24 Jun 2025). These paradigms demonstrate accelerated convergence on long-horizon or reasoning-intensive tasks, outperforming traditional two-step (SFT→RL) approaches and empirically reducing the exponential sample complexity bottleneck.
Prefix sampling methods: Prefix-RFT incorporates a demonstration prefix (partial ground-truth) at each training step, allowing the model to generate the remainder of the sequence via its own policy and thus unifying imitation and exploration (Huang et al., 2 Jul 2025). Entropy-based clipping strategies ensure that updates focus on the most “uncertain” tokens.
Importance-weighted fine-tuning: By introducing auxiliary, adaptive sampling distributions, iw-SFT trains on curated data with importance weights that optimize a tighter lower bound to the RL objective and can be generalized to incorporate quality scores (Qin et al., 17 Jul 2025).
Reward learning from demonstrations: A bilevel IRL-based formulation combines policy and reward-model updates, where the reward can be related in closed-form to log-likelihood ratios relative to a reference model (Li et al., 28 May 2024). Connections to self-play methods are drawn, as explicit contrastive loss between demonstration and synthetic (policy-sampled) responses appears central.

5. Vision-Specific SFT Protocols

Transferring SFT innovations from language to vision, ViSFT adapts the instruction-tuning paradigm to vision transformers:

Two-stage ViSFT: Stage 1 freezes the pre-trained backbone and optimizes task-specific heads for diverse in-domain annotation-rich tasks (object detection, instance segmentation, captioning). Stage 2 introduces LoRA-based updates for the backbone, with the bulk of parameters frozen, focusing learning on region-level cues absent from CLIP-style pretraining. Empirical gains are documented in OCR (+2.5% accuracy), classification, retrieval, and VQA across out-of-domain benchmarks (Jiang et al., 18 Jan 2024).
The method circumvents the need for large region-level pretraining datasets, effectively transferring fine-grained supervision post hoc without expensive re-annotation, and avoids restricting knowledge to task-specific heads via joint LoRA tuning on in-domain tasks.

6. Addressing Catastrophic Forgetting and Task Interference

Mitigating degradation of general capabilities and interference among tasks is an ongoing research imperative:

Catastrophic forgetting: Synthetic reconstruction of the instruction distribution—via multi-model response generation and filtering—enables practitioners to fine-tune open-source models on niche domains without loss of general capabilities, even absent the original proprietary data (Ding et al., 11 Jun 2025).
Freezing and parameter isolation: Dynamic freezing of core parameters during pipelined, lightweight final training phases prevents catastrophic forgetting as established in CPI-FT (Wang et al., 29 Aug 2025).
Task clustering: Overlap between core parameter regions provides a quantitative basis for grouping tasks to minimize destructive interference during SFT.

7. Practical Implications and Future Directions

Recent results underscore the importance of principled data curation, dynamic and token-wise weighting, robust optimization (proximal constraints, entropy preservation), and hybrid methods unifying SFT with exploration or reward-guided objectives. Comprehensive experimental releases (e.g., 1,000+ SFT models (Harada et al., 17 Jun 2025)) are expediting systematic analysis of dataset effects, task synergies, and architectures.

Open directions include:

Broader integration of trust-region, importance-sampling, and preference-ranking objectives to further stabilize SFT on real-world, noisy, or adversarial data.
More granular, token-level, and context-dependent data quality metrics or loss weighting in massive training regimes.
Extension of these insights from text and vision to multi-modal domains and open-ended, interactive settings.
Systematic, cross-model benchmarking and standardized resources to democratize reproducible research and comparative evaluation.

These developments establish SFT as a continually evolving, theoretically-motivated, and practically vital component of foundational model alignment and adaptation across modalities and domains.