Cold-Start SFT: Efficient Domain Adaptation

Updated 2 June 2026

Cold-Start SFT is a paradigm that fine-tunes pretrained models on small, targeted datasets while preventing catastrophic forgetting and preserving foundational skills.
It employs methods like conservative weighting, anchored KL-regularization, and synthetic data reconstruction to balance in-domain performance with out-of-distribution generalization.
Practical strategies include attention-based adaptation and selective layer tuning, ensuring efficient domain transfer and stable performance under data-scarce conditions.

Cold-Start Supervised Finetuning (SFT) denotes the adaptation or alignment of large pretrained models to novel domains or tasks using limited or inaccessible prior data, with minimal or no access to the original training distributions. This scenario is pervasive in open-source LLM adaptation, personalized recommendation, foundation model transfer (e.g., VLA models), low-resource domains (medical, legal), and many multimodal pipelines. The core challenge is to achieve sufficient in-domain performance while retaining the generalization and foundational capabilities encoded in the pretraining, without catastrophic forgetting or instability arising from data scarcity, distribution shift, or overfitting. Recent developments encompass data-centric, algorithmic, and architectural strategies that explicitly address these challenges under "cold start" constraints.

1. Formalization and Key Motivations

The cold-start SFT paradigm is characterized by the fine-tuning of a pretrained model $\theta_0$ on a (often small) target dataset $\mathcal{D}_{\text{target}}$ , without access to its original instruction corpus $\mathcal{D}_{\text{pretrain}}$ or SFT distribution. The typical objective is to optimize

$\mathcal{L}_{\text{SFT}}(\theta) = -\frac{1}{|\mathcal{D}_{\text{target}}|} \sum_{(x,y)\in \mathcal{D}_{\text{target}}} \log p_\theta(y|x),$

but direct minimization reliably incurs catastrophic forgetting and poor out-of-distribution (OOD) generalization (Ding et al., 11 Jun 2025, Zhu et al., 28 Sep 2025). The cold-start regime also encompasses low-budget annotation (few-shots), situations where the original SFT data is proprietary or unavailable, and domains where explicit replay buffers or teacher signals cannot be constructed (e.g., certain open-source foundation models, robotics).

Cold-start SFT is motivated by several observations:

Direct SFT on scarce target data maximizes in-domain performance but leads to severe erosion of pretrained competencies (Zhang et al., 9 May 2026).
Naïve adaptation without distributional regularization often fails to generalize out of the annotated region, especially in tasks demanding compositional reasoning or long-horizon action (Zhu et al., 28 Sep 2025, Li et al., 23 Nov 2025).
The initial phase of post-pretraining adaptation influences not only immediate performance but downstream policy RL (as cold-start for policy-gradient methods) (Wei et al., 28 May 2025, Chen et al., 29 Oct 2025).

2. Catastrophic Forgetting, Overfitting, and Risk Bounds

A central pathology in cold-start SFT is catastrophic forgetting: the rapid compromise of foundational skills in favor of target-domain performance (Ding et al., 11 Jun 2025, Zhang et al., 9 May 2026). This arises because unconstrained updates overwrite weights critical to generalization. Quantitatively, this can be formalized via the forgetting risk under the pretraining Fisher information matrix $F$ : $\mathcal{R}(g) = g^\top F g,$ where $g$ is the update direction. Vanilla SFT gradients $\nabla_\theta \mathcal{L}_{\mathrm{SFT}}$ induce large $\mathcal{R}(g)$ when high-loss, low-confidence target data pushes the model away from the pretraining optimum (Zhang et al., 9 May 2026). Experiments across flow-matching VLAs and LLMs demonstrate that vanilla SFT can reduce retention of foundational suites by $30$– $\mathcal{D}_{\text{target}}$ 0 points, an effect matched or exceeded by experience replay only at considerable data cost (Zhang et al., 9 May 2026). In LLM adaptation, catastrophic forgetting is closely linked to alignment drift from the base model’s instruction-distribution, and is measurable by direct evaluation on general benchmarks (e.g., MMLU, GPQA, Math Level 5) (Ding et al., 11 Jun 2025).

Overfitting further manifests in the memorization of demonstration idiosyncrasies, especially under data scarcity. Propositions from the reward-weighted regression (RWR) framework show that the standard SFT objective is a loose lower bound on the true RL objective, with the tightness declining as data becomes less representative (Zhu et al., 28 Sep 2025). The model is thus prone to “neckpick” on the training region while failing to extrapolate, with OOD accuracy plateauing or degrading under continued SFT.

3. Regularization and Distributional Anchoring Methods

Several algorithmic advances target the cold-start SFT bottleneck:

Conservative SFT: ConSFT introduces a per-sample exponential weighting $\mathcal{D}_{\text{target}}$ 1 to the loss, dynamically suppressing gradients from high-loss (low-confidence) samples. This bounds the forgetting risk $\mathcal{D}_{\text{target}}$ 2 by a multiplicative factor $\mathcal{D}_{\text{target}}$ 3, mimicking trust-region clipping without explicit reference models or replay (Zhang et al., 9 May 2026). ConSFT achieves up to $\mathcal{D}_{\text{target}}$ 4 points higher prior-task retention than vanilla SFT in VLA domains.
Anchored SFT (ASFT): ASFT augments the DFT (dynamic fine-tuning) objective via a reverse KL penalty toward the base model,

$\mathcal{D}_{\text{target}}$ 5

This controls drift, preserves bound tightness, and substantially improves stability and OOD performance across medical, math, and code domains (Zhu et al., 28 Sep 2025).

In-Distribution Fine-Tuning (IDFT): IDFT leverages token-level reweighting:

$\mathcal{D}_{\text{target}}$ 6

suppressing OOD samples and amplifying in-distribution examples, thus better matching the functional support of the pretraining distribution (Zhang et al., 12 Feb 2026).

Synthetic rehearsal and data reconstruction: Cold-Start SFT methods reconstruct a synthetic approximation to the original instruction–response distribution of the base model, mixing it with new domain data to mitigate forgetting. This is executed via multi-model sampling, cross-model likelihood scoring, and response filtering (Ding et al., 11 Jun 2025).

4. Data-Centric and Architectural Strategies

Empirical findings consistently highlight the primacy of data selection, construction, and mixing:

Perplexity Minimization: Low base-model perplexity on candidate SFT data is the strongest predictor of downstream gains, with Pearson $\mathcal{D}_{\text{target}}$ 7 across language and alignment benchmarks. Data should be pre-filtered by $\mathcal{D}_{\text{target}}$ 8 and sampled to maximize coverage of in-domain variability without excessive diversity or outliers (Harada et al., 17 Jun 2025).
Synthetic Data Construction: Where no instruction-following data is available, base models can be prompted to self-generate large sets of pseudo-instructions, with responses filtered and scored across multiple models for quality. Mixing domain data at 10–30% (remaining from synthetic general instructions) balances retention and adaptation. Excessive domain ratio or too little synthetic data reliably induces catastrophic forgetting (Ding et al., 11 Jun 2025).
Layer-wise Tuning: Updates in mid-layers of transformer stacks are most predictive of successful task alignment. Empirical correlations $\mathcal{D}_{\text{target}}$ 9 peak at $\mathcal{D}_{\text{pretrain}}$ 0– $\mathcal{D}_{\text{pretrain}}$ 1 of the network depth; freezing top and bottom layers or using LoRA adapters confined to mid-layers reduces cost without loss in accuracy (Harada et al., 17 Jun 2025).
Unsupervised Interleaving: In extreme label-scarcity, unsupervised cluster prediction over in-domain unlabeled data serves as an intermediate SFT step (Cluster & Tune). This step—applied as a single-epoch auxiliary task—nearly doubles accuracy for topical text classification with as few as $\mathcal{D}_{\text{pretrain}}$ 2 labeled examples, outperforming continued MLM and other unsupervised objectives (Shnarch et al., 2022).

5. Cold-Start SFT in Multimodal and Sequential RL Pipelines

In multimodal LLMs and VLAs, cold-start SFT is foundational for initializing downstream RL. Standard supervised approaches jointly teach reasoning content and output format, but often induce harmful instruction-style overfitting and low OOD generalization (Chen et al., 29 Oct 2025, Wei et al., 28 May 2025). Preference-based cold-start methods—especially self-distilled frameworks such as SPECS—explicitly decouple surface-form alignment from core reasoning via:

Self-distilled format-focused preference pairs (Chen et al., 29 Oct 2025).
DPO-based optimization of shallow format compliance, followed by RL for deep semantic correctness.
Quantification of generalization via the GF coefficient, capturing both in-domain and OOD gains and correlating with higher RL ceilings and stability.

The two-stage "cold-start then RL" recipe is empirically validated to consistently outperform SFT-only and RL-only methods, with absolute SOTA improvements of 4–6 points on MathVista and We-Math benchmarks and stable convergence profiles (Wei et al., 28 May 2025, Chen et al., 29 Oct 2025).

6. Task Decomposition, Attention-Based Adaptation, and Data Selection

Attention pattern analysis reveals that cold-start SFT adapts LLMs primarily by modulating a sparse, task-specific set of attention heads. Empirical studies show:

SFT rapidly activates attention heads specific to new domains, and complex-task adaptation patterns are well-approximated as linear combinations of simpler task perturbations (Zhao et al., 2024).
This compositionality can be leveraged by pre-finetuning on basic tasks in proportions estimated from activation pattern analysis, before target fine-tuning.
When private target data is lacking, public data can be selected for SFT by measuring the correlation between their activation patterns and a pseudo-private seed, leading to 2–3% accuracy gains over naive selection.

Most adaptation occurs in a few hundred steps; targeted SFT focused on attention-score matrix parameters—optionally freezing the remainder—maximizes efficiency and adaptation speed.

7. Practical Guidelines and Empirical Protocols

Summarized recommendations for cold-start SFT include:

Always compute base-model perplexity and filter for low-PPL examples (Harada et al., 17 Jun 2025).
Use synthetic data or reconstructed instruction-following corpora to preserve generalization if the original SFT set is unavailable (Ding et al., 11 Jun 2025).
Limit fine-tuning epochs (often 1–3 suffice in low-resource regimes) and monitor early for overfit; employ moderate domain data mixing ( $\mathcal{D}_{\text{pretrain}}$ 3).
Employ probability-based or entropy-sensitive weighting (ConSFT/ASFT/IDFT) to suppress high-variance or distribution-shift updates (Zhang et al., 9 May 2026, Zhang et al., 12 Feb 2026, Zhu et al., 28 Sep 2025).
Where possible, decouple surface-form alignment from reasoning by preference-based DPO cold-start and hybrid objectives (Chen et al., 29 Oct 2025).
Structure all SFT (especially in RL pipelines) to initialize policies for stability and improved exploration, not just instruction mimicry (Wei et al., 28 May 2025).
Within extremely low-label environments, incorporate unsupervised cluster-based auxiliary tasks before supervised SFT (Shnarch et al., 2022).
For parameter efficiency and scalability, utilize LoRA adapters for adaptation, focusing on mid-layers; gradient clipping and small learning rates ( $\mathcal{D}_{\text{pretrain}}$ 4– $\mathcal{D}_{\text{pretrain}}$ 5 to $\mathcal{D}_{\text{pretrain}}$ 6– $\mathcal{D}_{\text{pretrain}}$ 7) provide additional stability (Harada et al., 17 Jun 2025, Zhu et al., 28 Sep 2025).

Cold-start SFT, as now formulated across domains, enables robust, rapid, and safe adaptation of large pretrained models under stringent resource and data-availability constraints, aligning in-domain performance with retention of general and compositional skills.