Papers
Topics
Authors
Recent
2000 character limit reached

Two-Stage Fine-Tuning

Updated 11 January 2026
  • Two-stage fine-tuning is a training paradigm that sequentially decomposes model adaptation into two distinct phases to optimize both generalization and task-specific performance.
  • The method decouples learning objectives by freezing parameters in Stage I while performing specialized updates in Stage II, thus avoiding interference.
  • Empirical studies report that this approach boosts parameter efficiency and overall performance across diverse applications like LLMs, video editing, and class-imbalanced learning.

Two-stage fine-tuning is a versatile training paradigm in modern machine learning, involving the explicit sequential decomposition of model adaptation into two distinct optimization phases. This approach has arisen independently across numerous domains—including LLMs, diffusion-based video editing, multilingual reasoning, class-imbalanced classification, neural architecture search, knowledge distillation, and parameter estimation—each leveraging the structure of two-stage fine-tuning to address complex adaptation, representation, or generalization challenges. The defining characteristic is the strict separation of learning objectives or modules between the two stages, avoiding interference and optimizing synergies that are otherwise degraded under single-stage or simultaneous adaptation.

1. Formulation and Core Principles

Two-stage fine-tuning is characterized by the consecutive execution of two training or adaptation phases, with each stage serving a targeted role in the model’s overall adaptation. The first stage usually introduces new knowledge, alignment, or structure under strong constraints (e.g., parameter or modular scope, data type, or loss type), while the second stage performs specialized or task-specific adaptation, often with different trainable parameters, objectives, data regimes, or regularization strengths. Crucially, parameters or modules tuned in Stage I may be frozen in Stage II, and the two optimization steps are not interleaved.

Generalized Schematic

Let θ\theta denote all trainable parameters of a pre-trained model, and let L1\mathcal{L}_1, L2\mathcal{L}_2 be stage-specific loss functions. The workflow is:

  1. Stage I: Optimize minθ1L1(θ1;D1)\min_{\theta_1} \mathcal{L}_1(\theta_1; D_1), where θ1θ\theta_1 \subseteq \theta (often a strict subset: e.g., normalization scales, prompt vectors, head layers).
  2. Stage II: With θ1\theta_1 fixed, optimize minθ2L2(θ2;D2)\min_{\theta_2} \mathcal{L}_2(\theta_2; D_2) where θ2θ1=\theta_2 \cap \theta_1 = \emptyset or, in some cases, θ2θ1\theta_2 \supseteq \theta_1 with a different focus.

Distinct roles for L1\mathcal{L}_1 and L2\mathcal{L}_2 are essential: for instance, L1\mathcal{L}_1 may be a knowledge-injection loss, a reweighted loss for rare class amplification, or a distillation objective, while L2\mathcal{L}_2 is often a downstream task objective or another loss more attuned to the target evaluation metric.

2. Domain-Specific Instantiations

The two-stage paradigm has been instantiated in a wide range of domains with distinct module, data, and loss decompositions, including but not limited to:

Domain Stage I Objective / Module Stage II Objective / Module
Video diffusion/editing Norm tuning for temporal alignment Spatial adapters for per-frame fidelity
LLM instruction-tuning Broad medical knowledge injection MCQ exam adaptation
Multilingual LLMs Language alignment via code-switch English-only instruction tuning
Vision/NER Architectural mutation (NAS) Fine-tune mutated weights
Embodied AI Standard RL/BC adaptation Gradient noise, batch/sample reduction
Graph-to-text Wikipedia graph-text warmup Targeted graph-to-text fine-tuning
Class-imbalance learning Head-only, weighted loss Full model, standard loss
Model Fusion/Selection BO hyperparam/search trajectory BO-based model fusion (Pareto-optimal)
Multimodal retrieval-gen RL for filtering irrelevant docs RL for explainable QA and retrieval
Model distillation Distill pre-training Distill fine-tuning

Domain-specific designs exploit the decoupling property to mitigate mutual interference, unlock parameter efficiency, and yield representations more suited for generalization, specialization, or robustness (Xia et al., 11 May 2025, Zhou et al., 2024, Zhang et al., 2024, Wang et al., 2022, ValizadehAslani et al., 2022, Jang et al., 2024, Zhao et al., 19 Dec 2025, Wu et al., 14 Mar 2025, Pezzuti et al., 28 Mar 2025, Choi et al., 2023, Wan et al., 2024, Chen et al., 3 Dec 2025, Wang et al., 2021, Wu et al., 2024, Lakshminarayanan et al., 6 Apr 2025).

3. Mathematical and Optimization Structures

The two-stage framework leverages mathematically distinct loss landscapes, parameter subsets, or modules at each stage.

Sample decomposition for LLMs, biomedical tuning (Zhou et al., 2024): LStage 1=tlogPθ(ytx,y<t)LStage 2=i=14qilogpi\mathcal{L}_{\text{Stage 1}} = -\sum_{t} \log P_\theta(y_t \mid x, y_{<t}) \qquad \mathcal{L}_{\text{Stage 2}} = -\sum_{i=1}^4 q_i\log p_i with different LoRA ranks and learning rates.

4. Motivations and Theoretical Rationale

Common motivations for two-stage fine-tuning include:

5. Empirical Findings and Impact

Empirical studies across application domains consistently report that two-stage fine-tuning offers:

6. Ablations, Limitations, and Practical Guidance

Ablation studies and practical guidelines highlight several key points:

  • Sequentiality is critical: Merged or joint training of both stages leads to mutual degradation or suboptimal trade-offs between objectives (Xia et al., 11 May 2025, Zhang et al., 2024, ValizadehAslani et al., 2022).
  • Right module selection per stage: PEFT modules (e.g., LoRA), soft prompts, adapters, or layer freezing are often more effective than full-model updates, but their optimal scope may be task or domain dependent (Zhou et al., 2024, Wang et al., 2022, Wan et al., 2024).
  • Loss surfaces and metric misalignment: In domains such as LLM fusion, the metric of interest may be poorly aligned with the task loss, which two-stage BO fusion overcomes (Jang et al., 2024).
  • Scaling and transfer extension: Two-stage systems scale favorably to larger models or multilingual/low-resource regimes with appropriate adjustments (Zhang et al., 2024, Chen et al., 3 Dec 2025).
  • Limitations: Some schemes rely on the quality of pretrained features, the availability of high-resource data for the first stage, synthetic data for OOD adaptation, or assume simulator accessibility (Lakshminarayanan et al., 6 Apr 2025, Chen et al., 3 Dec 2025). There is sometimes sensitivity to module design and the proportion of replay or noise-injection samples.

7. Representative Algorithms and Quantitative Benchmarks

Distinct algorithmic blueprints and their measured benefits include:

  • DAPE for Video Editing: Norm-tuning (+0.20 CLIP-frame, +0.06 CLIP-text), adapter tuning for fidelity; two-stage decoupling achieves best composite metrics (Xia et al., 11 May 2025).
  • Medical LLMs: Stage I injection of 200k multilingual QA pairs; Stage II MCQ tuning; +3-17% accuracy over single-stage (Zhou et al., 2024).
  • ManiSkill2 Embodied Policy: Batch/sample noise injection in Stage II yields 3–15% absolute test gain (Gao et al., 2023).
  • LLM Generalization (ProMoT): Soft-prompt + model two-stage reduces format overfitting; up to +4.74 normalized average over conventional fine-tuning (Wang et al., 2022).
  • Class-Imbalanced Text: Head reweighting then full fine-tuning: up to +0.0161 F1 gain on tails, +0.0133 micro-F1 overall (ValizadehAslani et al., 2022).
  • Model Fusion (BOMF): Two-stage BO + fusion finds Pareto-optimal model averages, outperforming SWA by ≈ +1 pt GLUE, +0.5–1 BLEU/ROUGE (Jang et al., 2024).
  • SDF-TopoNet: Two-stage SDF regression then topological loss, yielding ≈200–400% improvements in both Dice and clDice over prior persistent-homology segmentation (Wu et al., 14 Mar 2025).

References

These primary sources provide implementation blueprints and empirical validation contexts for two-stage fine-tuning across modalities, tasks, and adaptation challenges.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Two-Stage Fine-Tuning.