Two-Phase Transfer-Learning Approach

Updated 5 December 2025

The paper demonstrates that a sequential two-phase transfer-learning framework significantly improves model generalization and mitigates overfitting compared to single-stage adaptation.
The methodology involves first adapting a pre-trained model on a controlled proxy dataset and then fine-tuning it on a specialized target domain, yielding measurable accuracy gains.
The approach has been empirically validated across domains such as vision, health informatics, and physics-informed modeling, consistently enhancing convergence and reducing error rates.

A two-phase transfer-learning approach is a structured methodology in which model adaptation occurs in sequential stages, each with distinct objectives and target data domains. Unlike single-step fine-tuning, the two-phase strategy decomposes domain shifts or task adaptation into more manageable sub-problems, typically resulting in superior generalization, robustness under scarce data, and strong empirical performance across a range of applications, including vision, health informatics, kernel regression, Bayesian optimization, and physics-informed modeling.

1. Formal Definition and Paradigm Scope

A two-phase transfer-learning protocol consists of:

Phase I (Initial Transfer / Pre-adaptation): The model is adapted from a generic source domain (often large-scale natural datasets, simulations, or previously annotated corpora) to an intermediate target that serves as a controlled proxy with less complex or more closely matched characteristics to the ultimate task.
Phase II (Final Transfer / Task-Specific Adaptation): The intermediate model is further fine-tuned or re-adapted to the primary task domain, which may be smaller, more specialized, less labeled, or subject to greater domain-specific artifacts.

This decomposition addresses large domain discrepancies by distributing representational and statistical shifts over two transfer steps, yielding both improved convergence and resistance to overfitting or catastrophic forgetting.

2. Methodological Instantiations and Backbone Architectures

Two-phase transfer learning is implemented across model families and methodological frameworks:

Deep CNNs: The typical setup uses a backbone (e.g., ResNet-50 in "Improving automatic endoscopic stone recognition..." (Lopez-Tiro et al., 2023), or AlexNet in "Feature Representation Analysis..." (Suzuki et al., 2018)) pretrained on ImageNet. Phase I fine-tunes all layers on a proxy dataset ("ex-vivo CCD-camera stones" or "CUReT textures"), followed by Phase II fine-tuning on the final medical or domain-specific dataset ("endoscopic stones," "DLD CT patches").
Attention and Multi-View Fusion: After the two transfer phases, branches are fused (SUR/SEC) via late-fusion blocks (max-pooling or concatenation+FC), with optional attention enhancement modules, such as CBAM (Lopez-Tiro et al., 2023).
Kernel Ridge Regression (KRR): SATL (Lin et al., 22 Feb 2024) sets Phase I to source KRR estimation and Phase II to target offset learning using the same Gaussian RBF kernel, with adaptive bandwidth and regularization tuned on Sobolev smoothness grids through validation or Lepski's method.
Bayesian Optimization (GP Surrogates): TransBO (Li et al., 2022) executes Phase I as supervised cooperative weighting of multiple source GP models to form a linear ensemble, and Phase II as adaptive blending of the source surrogate with a target GP, using cross-validation ranking-loss minimization.
Physics-Informed Neural Networks (PINNs): In (Yeregui et al., 28 Mar 2025), Phase I pretrains a PINN on analytical physics constraints (SPM PDEs, boundary conditions), and Phase II fine-tunes only select, physically interpretable parameters against field data, freezing all branch/trunk weights.
Parameter-Efficient Transfer Learning (PETL): A two-stage alignment-adaptation paradigm (Zhao et al., 2023) first tunes LayerNorm scales/shifts for feature distribution matching, then uses Taylor expansion-based scores to select and adapt only the most task-relevant channels via tiny adapters.

3. Domain Shift Decomposition and Theoretical Rationale

A critical motivation is progressive reduction of domain shift:

Stepwise Shift Handling: Initial transfer from a broad, generic domain (e.g., ImageNet) to a proxy domain that is closer to the target but still comparatively well-structured (e.g., ex-vivo, low-noise, controlled acquisition), followed by the final adaptation to the small, noisily-sampled, artifact-rich task domain. This progressive approach achieves higher discriminative power and avoids negative transfer compared to direct single-step adaptation (Lopez-Tiro et al., 2023).
Distributional Robustness: Each phase tackles a smaller statistical discrepancy, allowing more stable and semantically meaningful feature learning (e.g., edges in phase I, fine textures in phase II for medical vision (Suzuki et al., 2018)).
Minimax Optimality in Regression: SATL derives explicit upper and lower excess risk bounds for the two-phase kernel regression estimator, showing that the transfer-inspired offsets yield convergence rates that optimally combine sample sizes and smoothness exponents of both source and target (Lin et al., 22 Feb 2024).

4. Training Protocols, Layer Freezing, and Attention Mechanics

Training details vary across applications:

Study/Method	Freezing/Update Policy	Loss Function(s)
Endoscopic Stones (ResNet-50, (Lopez-Tiro et al., 2023))	Full fine-tuning in Phase I/II; feature layers frozen in fusion	Cross-entropy, patch-based
SATL (KRR, (Lin et al., 22 Feb 2024))	Fixed kernel, adaptive λ in both phases	Squared-error, kernel regularization
Physics PINN (Yeregui et al., 28 Mar 2025)	PINN weights frozen after Phase I; only select param/FFNN unfrozen in Phase II	PDE+BC residuals, voltage-tracking
PETL (ViT, (Zhao et al., 2023))	LayerNorm only in Stage 1, top-K tiny adapters in Stage 2	Cross-entropy, channel selection
SHM MLP (Tsialiamanis et al., 2022)	Pretrained 1st layer frozen, task-head only updated in Phase 2	Categorical cross-entropy

Attention modules, e.g., CBAM, are inserted post-transfer to further refine spatial/channel feature saliency, substantially improving late-fusion accuracy (Lopez-Tiro et al., 2023).

5. Quantitative Gains and Empirical Validation

Two-phase approaches consistently outperform single-phase or naïve baselines:

Kidney Stone Classification (Lopez-Tiro et al., 2023): Two-step TL improves single-view accuracy by +13 to +16 pp over scratch, with +9.6 pp for naïve mix; fused multi-view reaches 91.25% accuracy (vs. 80% for baselines).
Medical CT Texture Classification (Suzuki et al., 2018): Two-stage transfer achieves 96.01% accuracy (F1=0.9724), outperforming single-stage transfer by 0.4% absolute, with improved robustness at all sampled training fractions.
SHM Damage Localization (Tsialiamanis et al., 2022): Partitioned/transfer scheme yields 98.82% combined test accuracy and increased feature-space separability, with faster convergence than direct scratch training.
Kernel Hypothesis Transfer (Lin et al., 22 Feb 2024): SATL attains excess risk matching the theoretical minimax bound under unknown smoothness, with error decaying as the sum of source/target rate exponents.
Parameter-Efficient Transfer (Zhao et al., 2023): TTC-tuning reaches VTAB-1k mean accuracy 74.8% with only 0.19M tunable parameters, outperforming SSF (73.1%, 0.24M params) and full fine-tuning.
Bayesian Optimization (Li et al., 2022): TransBO achieves fastest rank descent in HPO and NAS; no negative transfer due to adaptive p^T blending.
Physics-Informed Battery Estimation (Yeregui et al., 28 Mar 2025): Final parameter estimation error <4% (Raspberry Pi deployment), with bulk PINN weights frozen and only physical/FFNN layers tuned at field time.

6. Extensions, Generalizations, and Limitations

Two-phase schemes generalize to multi-view and multi-modal settings (MRI T1/T2, mammography CC/MLO) where each domain is incrementally closer to the final imaging conditions (Lopez-Tiro et al., 2023). Methods such as TransBO and SATL can operate in sequential multi-source settings, recursively updating transfer weights. PINN-based approaches extend to low-cost edge devices for real-time physical parameter tracking (Yeregui et al., 28 Mar 2025).

Limitations include:

Necessity for suitable intermediate domains or proxies.
Complexity of validation- or gradient-based channel selection (PETL).
Difficulty in applications with severe source-target dissimilarity (generative model-based TL (Yamaguchi et al., 2022)).
Non-convexity of some optimization stages (L2T (Wei et al., 2017)).
Structure-specific transfer masks/hyperparameters (MSGTL (Mendes et al., 2020)).

7. Insights and Theoretical Implications

Empirical and theoretical evidence supports that progressive or decomposed transfer accelerates convergence, preserves semantic feature hierarchies, improves generalization in small-sample scenarios, and reduces overfitting in high-dimensionality, low-data contexts. Two-phase strategies align with meta-cognitive and supervised aggregation concepts, as exemplified in reflection learning (L2T (Wei et al., 2017)) and probabilistic fine-tuning (MSGTL (Mendes et al., 2020)).

The core principle—that learning domain-appropriate structure in intermediate phases substantially increases the quality and stability of final adaptation—translates across architectures, learning models, and practical deployments.