Sequential Fine-Tuning (2SFT)

Updated 19 December 2025

Sequential Fine-Tuning (2SFT) is a training strategy that adapts pre-trained models in ordered stages to optimize performance on downstream tasks.
It divides fine-tuning into distinct phases, allowing for targeted improvements in domains like multilingual NLP, computer vision, and continual learning.
The approach addresses trade-offs such as catastrophic forgetting by carefully managing transitions between sequential learning objectives.

Sequential Fine-Tuning (2SFT) refers to a family of training strategies for adapting pre-trained models to downstream tasks via a deliberate, temporally ordered division into two or more distinct fine-tuning phases. In the canonical two-stage form, a model is first adapted on a source (“Stage 1”) dataset or objective, before continuing fine-tuning on a distinct target (“Stage 2”) dataset or objective without resetting the model’s parameters. Empirically and theoretically, 2SFT has shown distinct performance profiles across multilingual NLP, computer vision, differential privacy, multi-task and continual learning, and medical imaging, with benefits and failure modes shaped by data-resource regimes, pretraining histories, and transfer settings.

1. Formal Algorithmic Structure and Variants

Sequential fine-tuning decomposes the downstream adaptation into a fixed temporal order:

Stage 1: Model $\theta$ is initialized from a pre-trained state $\theta^0$ and optimized for a loss $\mathcal{L}_{\text{stage1}}$ on dataset $D_1$ :

$\theta^{(1)} = \arg\min_{\theta} \mathcal{L}_{\text{stage1}}(\theta; D_1)$

Stage 2: Optimization resumes from $\theta^{(1)}$ using a new loss $\mathcal{L}_{\text{stage2}}$ on $D_2$ :

$\theta^{(2)} = \arg\min_{\theta} \mathcal{L}_{\text{stage2}}(\theta; D_2)$

Alternatives include:

Monolingual/mono-task fine-tuning: Only one dataset/phase.
Simultaneous (joint/multitask) fine-tuning: Datasets $D_1$ and %%%%10%%%% are merged; optimization alternates or samples stochastically.
Complex 2SFT: Intermediate model consolidation, task-specific loss/adapter modules, explicit knowledge retention/regularization.

Implementational examples include partitioning neural network parameters into groups (top-down “progressive unfreezing” (Tan et al., 2018)), data-regime-based language or task switching (Sammartino et al., 15 Aug 2025, Ye et al., 7 Sep 2025), privacy objective splitting (Ke et al., 29 Feb 2024), and preference-learning alternation (Fernando et al., 20 Oct 2024).

2. Core Use Cases in Transfer and Multitask Learning

Sequential fine-tuning has proven particularly salient for:

A. Cross-lingual transfer:

In euphemism detection, 2SFT (e.g., XLM-R, mBERT) transfers from high-resource L1 to low-resource L2, yielding Macro-F1 gains in the 0.01–0.03 range (e.g., EN→TR: +0.011, ZH→YO: +0.022) and up to +0.06 in special settings (EN→TR, mBERT). The benefit is especially marked for low-resource targets and when source and target languages differ typologically, although pretraining coverage is a dominant driver over genealogy (Sammartino et al., 15 Aug 2025).

B. Multi-phase LLM post-training:

Standard post-training of LLMs uses SFT on instruction data followed by preference learning (RLHF, DPO) in a sequential regimen. Analytical results confirm that 2SFT cannot reach any Pareto-optimal point for the combined objectives unless their optima coincide. Catastrophic forgetting of early-phase objectives is provable and empirically manifests as a trade-off between SFT and RLHF performance (Fernando et al., 20 Oct 2024).

C. Multistage continual learning/tasks:

In class-incremental continual learning (CL), sequential fine-tuning is a crucial baseline. Vanilla 2SFT leads to progressive overfitting/forgetting unless modifications such as slow learning rates (SLCA++), parameter-efficient adapters, and classifier alignment are used (Zhang et al., 15 Aug 2024). Similarly, in medical imaging, sequential adaptation (e.g., MedSeqFT) couples task-ordered adaptation with knowledge-distillation and data-similarity buffering for knowledge retention (Ye et al., 7 Sep 2025).

D. Layerwise unfreezing:

In data-sparse regimes, progressive or staged unfreezing from top (classifier) to early (convolutional or embedding) layers stabilizes feature adaptation, yielding sharp accuracy gains relative to full fine-tuning or head-only updates (Tan et al., 2018).

3. Regularization, Forgetting, and Optimization Considerations

Regularization Approaches

Most vanilla 2SFT implementations do not use explicit continual-learning penalties such as EWC or distillation, instead relying on implicit regularizers (weight decay, dropout, early stopping). Extensions in continual and multi-task learning settings introduce:

Slow learner (layerwise small LR scaling) to counteract rapid feature drift (Zhang et al., 15 Aug 2024).
KL divergence to reference models to maintain stability under joint objective optimization (Fernando et al., 20 Oct 2024).
Knowledge distillation and maximum data similarity (MDS) buffer selection to reinforce pretraining-aligned representations (Ye et al., 7 Sep 2025).
Parameter isolation and freezing to preserve task-specific “core” regions and reduce destructive interference (Wang et al., 29 Aug 2025).

Forgetting and Sub-Optimality

Formal analysis demonstrates that in the presence of objective conflict (e.g., $\mathcal{L}_1$ versus $\mathcal{L}_2$ ):

2SFT yields model parameters $\hat{\theta}$ at a nonzero distance from the Pareto front. The final solution is effectively at one optimizer’s minimum and far from jointly optimized solutions (Fernando et al., 20 Oct 2024).
Catastrophic forgetting is observed in multilingual settings (YO $\to$ EN, -0.331 Macro-F1 drop for XLM-R (Sammartino et al., 15 Aug 2025)) and LLM preference learning (SFT accuracy drop of -6.5 to -8.8 points under 2SFT, versus -2.1 points for joint optimization) (Fernando et al., 20 Oct 2024).

Explicit freezing and slow adaptation substantially reduce forgetting: e.g., CPI-FT reduces performance drop on prior tasks by 76% relative to full SFT (Wang et al., 29 Aug 2025).

4. Empirical Performance and Task-Specific Results

Meta-analytical trends across domains:

Domain	Typical 2SFT Benefit	Reference
Multilingual NLP	+0.01–0.03 Macro-F1 on low-resource L2 tasks	(Sammartino et al., 15 Aug 2025)
LLM Math	+2 pp Pass@1, +2 pp Maj1@64 on MATH; up to 58.8% acc.	(Liu et al., 2023)
Medical Imaging	+3.0% Dice, -10 mm HD95 on 10-task 3D segmentation	(Ye et al., 7 Sep 2025)
Lane Detection	+0.02–0.06 F1, -67–79% training epochs to convergence	(Li et al., 2023)
CL Image Class.	+2.8–9.6 pp Last-Acc over SOTA CL on domain/class inc.	(Zhang et al., 15 Aug 2024)

In vision and imaging, stagewise fine-tuning enables rapid convergence (10–12 epochs with 2SFT+PolyLoss versus ~100 epochs from scratch (Li et al., 2023)) and often higher final accuracy and F1 than head-only or all-layer baseline strategies (Tan et al., 2018). In math and procedural reasoning, staged generation and evaluation, or sequential instruction chaining, robustly boost accuracy and multi-task following capacity (Liu et al., 2023, Hu et al., 12 Mar 2024).

5. Implementation, Hyperparameters, and Canonical Recipes

Optimization Details

Representative settings across studies:

NLP & Cross-lingual: AdamW (β₁=0.9, β₂=0.999, weight decay=0.01), learning rate $=1\times10^{-5}$ , batch size = 4, early stopping (patience=5 epochs), 5 train–dev–test splits (Sammartino et al., 15 Aug 2025).
Medical Imaging (MedSeqFT): AdamW, learning rates $3\times10^{-4}$ (VoCo) or $1\times10^{-4}$ (UniMiSS+), LoRA rank = 2 for KD (Ye et al., 7 Sep 2025).
CL (SLCA++): Learning rate scaling $\alpha_t=10^{-2}\ldots10^{-4}$ for backbone, SCE loss $\mathcal{L}_{\text{SCE}}$ ; parameter-efficient adapters (LoRA) where only $\sim0.7\%$ of weights are updated (Zhang et al., 15 Aug 2024).
Lane Detection: RAdam optimizer, batch size 60, PolyLoss or weighted-CE for segmentation (Li et al., 2023).

Architectural and Data-Handling Nuances

Buffering in Knowledge Distillation: Data buffers are assembled from MDS-selected samples (typically 5–10% of each task) in MedSeqFT (Ye et al., 7 Sep 2025).
Progressive Unfreezing: Staged expansion from classifier-layer-only tuning towards full-batch adaptation in DenseNet-121; each group unfrozen for small epoch blocks (e.g., 5 epochs/block, 30 blocks for full network) (Tan et al., 2018).
Parameter Isolation: CPI-FT identifies, merges, and freezes small “core” parameter sets for each task, using SLERP to interpolate remaining weights (Wang et al., 29 Aug 2025).
Data Augmentation: Automatic and manual synthetic intermediate-chains for sequential instruction data; manual creation (translate then answer), or automated LLM/GPT-3.5-based augmentation for new instruction chains (Hu et al., 12 Mar 2024).

6. Analysis of Success, Limitations, and Generalization

The main benefits of 2SFT include improved transfer in low-resource or highly-specialized settings, taskwise control over knowledge integration, and often faster convergence. However, unmitigated 2SFT is vulnerable to severe forgetting and cannot reach joint optima for objectives in tension—a direct outcome of the convexity theory in LLM post-training (Fernando et al., 20 Oct 2024). Catastrophic forgetting is especially pronounced where source and target data are highly imbalanced, or when pretraining coverage of source or target is poor (Sammartino et al., 15 Aug 2025).

In response, advanced frameworks (SLCA++, MedSeqFT, CPI-FT) systematically incorporate slow learning schedules, buffer-based KD, core-parameter freezing, and optimized parameter merging to approach or surpass joint-training upper bounds, filling most of the empirical performance gap in task-incremental or domain-incremental scenarios (Zhang et al., 15 Aug 2024, Ye et al., 7 Sep 2025, Wang et al., 29 Aug 2025).

7. Outlook and Theoretical Perspectives

The provable sub-optimality of vanilla 2SFT (or more generally, any strictly sequential optimization on conflicting objectives) in multi-objective, multi-modal or multi-phase post-training is now well established. This motivates a transition toward joint-learning approaches (e.g., XRIGHT joint SFT+PL) or hybrid fine-tuning regimens that blend the modularity and simplicity of 2SFT with the empirical robustness of regularization and knowledge preservation constraints (Fernando et al., 20 Oct 2024). Further, parameter isolation strategies leveraging per-task sensitivity and geometry-aware parameter fusion indicate a promising direction for principled multi-task LLM adaptation without catastrophic forgetting (Wang et al., 29 Aug 2025).

Although specific gains and pitfalls are architecture- and data-dependent, sequential fine-tuning remains a foundational mechanism for transferring learned knowledge under resource constraints, privacy regimes, or continual learning settings, and an active locus for algorithmic innovation in the presence of competing downstream requirements.