Two-Stage Fine-Tuning

Updated 11 January 2026

Two-stage fine-tuning is a training paradigm that sequentially decomposes model adaptation into two distinct phases to optimize both generalization and task-specific performance.
The method decouples learning objectives by freezing parameters in Stage I while performing specialized updates in Stage II, thus avoiding interference.
Empirical studies report that this approach boosts parameter efficiency and overall performance across diverse applications like LLMs, video editing, and class-imbalanced learning.

Two-stage fine-tuning is a versatile training paradigm in modern machine learning, involving the explicit sequential decomposition of model adaptation into two distinct optimization phases. This approach has arisen independently across numerous domains—including LLMs, diffusion-based video editing, multilingual reasoning, class-imbalanced classification, neural architecture search, knowledge distillation, and parameter estimation—each leveraging the structure of two-stage fine-tuning to address complex adaptation, representation, or generalization challenges. The defining characteristic is the strict separation of learning objectives or modules between the two stages, avoiding interference and optimizing synergies that are otherwise degraded under single-stage or simultaneous adaptation.

1. Formulation and Core Principles

Two-stage fine-tuning is characterized by the consecutive execution of two training or adaptation phases, with each stage serving a targeted role in the model’s overall adaptation. The first stage usually introduces new knowledge, alignment, or structure under strong constraints (e.g., parameter or modular scope, data type, or loss type), while the second stage performs specialized or task-specific adaptation, often with different trainable parameters, objectives, data regimes, or regularization strengths. Crucially, parameters or modules tuned in Stage I may be frozen in Stage II, and the two optimization steps are not interleaved.

Generalized Schematic

Let $\theta$ denote all trainable parameters of a pre-trained model, and let $\mathcal{L}_1$ , $\mathcal{L}_2$ be stage-specific loss functions. The workflow is:

Stage I: Optimize $\min_{\theta_1} \mathcal{L}_1(\theta_1; D_1)$ , where $\theta_1 \subseteq \theta$ (often a strict subset: e.g., normalization scales, prompt vectors, head layers).
Stage II: With $\theta_1$ fixed, optimize $\min_{\theta_2} \mathcal{L}_2(\theta_2; D_2)$ where $\theta_2 \cap \theta_1 = \emptyset$ or, in some cases, $\theta_2 \supseteq \theta_1$ with a different focus.

Distinct roles for $\mathcal{L}_1$ and $\mathcal{L}_2$ are essential: for instance, $\mathcal{L}_1$ may be a knowledge-injection loss, a reweighted loss for rare class amplification, or a distillation objective, while $\mathcal{L}_2$ is often a downstream task objective or another loss more attuned to the target evaluation metric.

2. Domain-Specific Instantiations

The two-stage paradigm has been instantiated in a wide range of domains with distinct module, data, and loss decompositions, including but not limited to:

Domain	Stage I Objective / Module	Stage II Objective / Module
Video diffusion/editing	Norm tuning for temporal alignment	Spatial adapters for per-frame fidelity
LLM instruction-tuning	Broad medical knowledge injection	MCQ exam adaptation
Multilingual LLMs	Language alignment via code-switch	English-only instruction tuning
Vision/NER	Architectural mutation (NAS)	Fine-tune mutated weights
Embodied AI	Standard RL/BC adaptation	Gradient noise, batch/sample reduction
Graph-to-text	Wikipedia graph-text warmup	Targeted graph-to-text fine-tuning
Class-imbalance learning	Head-only, weighted loss	Full model, standard loss
Model Fusion/Selection	BO hyperparam/search trajectory	BO-based model fusion (Pareto-optimal)
Multimodal retrieval-gen	RL for filtering irrelevant docs	RL for explainable QA and retrieval
Model distillation	Distill pre-training	Distill fine-tuning

Domain-specific designs exploit the decoupling property to mitigate mutual interference, unlock parameter efficiency, and yield representations more suited for generalization, specialization, or robustness (Xia et al., 11 May 2025, Zhou et al., 2024, Zhang et al., 2024, Wang et al., 2022, ValizadehAslani et al., 2022, Jang et al., 2024, Zhao et al., 19 Dec 2025, Wu et al., 14 Mar 2025, Pezzuti et al., 28 Mar 2025, Choi et al., 2023, Wan et al., 2024, Chen et al., 3 Dec 2025, Wang et al., 2021, Wu et al., 2024, Lakshminarayanan et al., 6 Apr 2025).

3. Mathematical and Optimization Structures

The two-stage framework leverages mathematically distinct loss landscapes, parameter subsets, or modules at each stage.

Parameter isolation/freeze: Explicit parameter scopes per stage (e.g., normalize-and-adapt, head-then-body, prompt-then-model).
Customized loss functions: Stage I may use Huber loss on residuals (Xia et al., 11 May 2025), margin-based reweighting (ValizadehAslani et al., 2022), language modeling (Chen et al., 3 Dec 2025), or custom regularizers (BO/fusion, (Jang et al., 2024)); Stage II typically employs task-specific CE or structured RL (MMRAG, (Zhao et al., 19 Dec 2025)).
Batch and schedule modulation: Learning rate decay, noisy gradient injection (smaller batch, reduced sample size) in Stage II for regularization and better generalization (Gao et al., 2023).
Adapter, prompt, or side-branch modules: Parameter-efficient adapters (LoRA, PEFT), soft prompts, or alignment layers are developed and frozen or replaced between stages (Zhang et al., 2024, Zhou et al., 2024, Xia et al., 11 May 2025).

Sample decomposition for LLMs, biomedical tuning (Zhou et al., 2024): $\mathcal{L}_{\text{Stage 1}} = -\sum_{t} \log P_\theta(y_t \mid x, y_{<t}) \qquad \mathcal{L}_{\text{Stage 2}} = -\sum_{i=1}^4 q_i\log p_i$ with different LoRA ranks and learning rates.

4. Motivations and Theoretical Rationale

Common motivations for two-stage fine-tuning include:

Decoupling mutually adverse adaptations: In video editing, temporal norm tuning and spatial detail enhancement conflict if trained jointly; separating them allows each to reach optimality without degradation (Xia et al., 11 May 2025).
Mitigating overfitting and improving generalization: Stage II with more gradient noise or targeted replay (e.g., reweighted batches) combats sharp minima and catastrophic forgetting (Gao et al., 2023, Li et al., 2024, ValizadehAslani et al., 2022).
Boosted adaptation for under-represented modalities or languages: Dedicated alignment or pre-adaptation stages allow low-resource languages or classes to benefit from richer pretraining (Zhang et al., 2024, Chen et al., 3 Dec 2025, ValizadehAslani et al., 2022).
Efficient and modular parameter utilization: PEFT, soft prompts, adapters: only a small subset of parameters is trained in each stage, reducing memory and time cost (Zhou et al., 2024, Xia et al., 11 May 2025, Wan et al., 2024, Zhang et al., 2024).
Improved multitask and out-of-distribution (OOD) resilience: By priming models on general or partially known information, two-stage approaches enhance OOD capabilities and context transfer (Li et al., 2024, Chen et al., 3 Dec 2025, Wang et al., 2022).

5. Empirical Findings and Impact

Empirical studies across application domains consistently report that two-stage fine-tuning offers:

Performance gains versus conventional or single-stage fine-tuning, especially on minority-domain or minority-class metrics, and in cross-lingual or OOD scenarios (Zhou et al., 2024, Xia et al., 11 May 2025, Zhang et al., 2024, Wu et al., 14 Mar 2025, Pezzuti et al., 28 Mar 2025, Jang et al., 2024).
Reduction in overfitting and better retention of pretrained knowledge or skills, as seen in class balance (ValizadehAslani et al., 2022), knowledge replay (Li et al., 2024), or modularity (Wang et al., 2022).
Parameter- and compute-efficiency, often requiring less than 1% of full model updates per stage, and enabling single-GPU or limited-resource adaptation where appropriate (Zhou et al., 2024, Xia et al., 11 May 2025, Wan et al., 2024, Wu et al., 2024).
Established best practices for task freezing/unfreezing, adaptive reweighting, selection of checkpointing strategies, and robust model fusion (Jang et al., 2024, Wu et al., 14 Mar 2025, Kim et al., 2022).

6. Ablations, Limitations, and Practical Guidance

Ablation studies and practical guidelines highlight several key points:

Sequentiality is critical: Merged or joint training of both stages leads to mutual degradation or suboptimal trade-offs between objectives (Xia et al., 11 May 2025, Zhang et al., 2024, ValizadehAslani et al., 2022).
Right module selection per stage: PEFT modules (e.g., LoRA), soft prompts, adapters, or layer freezing are often more effective than full-model updates, but their optimal scope may be task or domain dependent (Zhou et al., 2024, Wang et al., 2022, Wan et al., 2024).
Loss surfaces and metric misalignment: In domains such as LLM fusion, the metric of interest may be poorly aligned with the task loss, which two-stage BO fusion overcomes (Jang et al., 2024).
Scaling and transfer extension: Two-stage systems scale favorably to larger models or multilingual/low-resource regimes with appropriate adjustments (Zhang et al., 2024, Chen et al., 3 Dec 2025).
Limitations: Some schemes rely on the quality of pretrained features, the availability of high-resource data for the first stage, synthetic data for OOD adaptation, or assume simulator accessibility (Lakshminarayanan et al., 6 Apr 2025, Chen et al., 3 Dec 2025). There is sometimes sensitivity to module design and the proportion of replay or noise-injection samples.

7. Representative Algorithms and Quantitative Benchmarks

Distinct algorithmic blueprints and their measured benefits include:

DAPE for Video Editing: Norm-tuning (+0.20 CLIP-frame, +0.06 CLIP-text), adapter tuning for fidelity; two-stage decoupling achieves best composite metrics (Xia et al., 11 May 2025).
Medical LLMs: Stage I injection of 200k multilingual QA pairs; Stage II MCQ tuning; +3-17% accuracy over single-stage (Zhou et al., 2024).
ManiSkill2 Embodied Policy: Batch/sample noise injection in Stage II yields 3–15% absolute test gain (Gao et al., 2023).
LLM Generalization (ProMoT): Soft-prompt + model two-stage reduces format overfitting; up to +4.74 normalized average over conventional fine-tuning (Wang et al., 2022).
Class-Imbalanced Text: Head reweighting then full fine-tuning: up to +0.0161 F1 gain on tails, +0.0133 micro-F1 overall (ValizadehAslani et al., 2022).
Model Fusion (BOMF): Two-stage BO + fusion finds Pareto-optimal model averages, outperforming SWA by ≈ +1 pt GLUE, +0.5–1 BLEU/ROUGE (Jang et al., 2024).
SDF-TopoNet: Two-stage SDF regression then topological loss, yielding ≈200–400% improvements in both Dice and clDice over prior persistent-homology segmentation (Wu et al., 14 Mar 2025).

References

DAPE: Dual-Stage Parameter-Efficient Fine-Tuning for Consistent Video Editing with Diffusion Models (Xia et al., 11 May 2025)
Towards Democratizing Multilingual LLMs For Medicine Through A Two-Stage Instruction Fine-tuning Approach (Zhou et al., 2024)
LinguaLIFT: An Effective Two-stage Instruction Tuning Framework for Low-Resource Language Reasoning (Zhang et al., 2024)
Two-stage LLM Fine-tuning with Less Specialization and More Generalization (Wang et al., 2022)
Two-Stage Fine-Tuning: A Novel Strategy for Learning Class-Imbalanced Data (ValizadehAslani et al., 2022)
Model Fusion through Bayesian Optimization in LLM Fine-Tuning (Jang et al., 2024)
MMRAG-RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation (Zhao et al., 19 Dec 2025)
SDF-TopoNet: A Two-Stage Framework for Tubular Structure Segmentation via SDF Pre-training and Topology-Aware Fine-Tuning (Wu et al., 14 Mar 2025)
Exploring the Effectiveness of Multi-stage Fine-tuning for Cross-encoder Re-rankers (Pezzuti et al., 28 Mar 2025)
Incremental Few-Shot Object Detection via Simple Fine-Tuning Approach (Choi et al., 2023)
Metadata-Enhanced Speech Emotion Recognition: Augmented Residual Integration and Co-Attention in Two-Stage Fine-Tuning (Wan et al., 2024)
Adapting LLMs to Low-Resource Tibetan: A Two-Stage Continual and Supervised Fine-Tuning Study (Chen et al., 3 Dec 2025)
Stage-wise Fine-tuning for Graph-to-Text Generation (Wang et al., 2021)
PanAdapter: Two-Stage Fine-Tuning with Spatial-Spectral Priors Injecting for Pansharpening (Wu et al., 2024)
Fine Tuning a Data-Driven Estimator (Lakshminarayanan et al., 6 Apr 2025)
Two-stage architectural fine-tuning with neural architecture search using early-stopping in image classification (Kim et al., 2022)

These primary sources provide implementation blueprints and empirical validation contexts for two-stage fine-tuning across modalities, tasks, and adaptation challenges.