Fine-Tuning Pre-Trained Models

Updated 4 December 2025

Fine-tuning pre-trained models is the process of adapting a model originally trained on generic data to a specific task by updating all or a subset of its parameters.
Parameter-efficient techniques such as BitFit, adapters, and structured pruning enable significant resource savings while retaining or improving task performance.
Robustness strategies, including regularization, calibration, and fairness adjustments, help mitigate issues like catastrophic forgetting and distribution shifts.

Fine-tuning pre-trained models denotes the paradigm in which a model first trained on a large, generic source distribution is subsequently adapted to a target task of interest, often with limited labeled data. Fine-tuning is central in modern deep learning for language, vision, 3D geometry, and code, enabling rapid transfer of general representations to specialized domains. Despite its empirical utility, fine-tuning introduces complex trade-offs: catastrophic forgetting, data efficiency, robustness under distribution shift, fairness across demographic groups, and engineering choices such as parameter efficiency and stability.

1. Fundamental Principles and Standard Fine-Tuning Regimes

Fine-tuning proceeds by initializing a model with pre-trained parameters (typically learned on large-scale, often task-agnostic, datasets) and performing adaptation—via full or partial parameter updates—on a downstream (target) dataset. Standard regimes include full model fine-tuning (all parameters adapted), linear probing (only classification head trained), and partial tuning (convex combinations, adapters, or prompt-based transfer).

Theoretically, the excess risk in fine-tuning depends on four factors: (1) the proximity of source and target task gradients ("gradient closeness"), (2) the smoothness of the target loss surface, (3) the Polyak–Łojasiewicz property ensuring convergence, and (4) stochasticity of the gradients. Excess risk bounds (cf. (Liu et al., 2021)) reveal that the benefit from pre-training diminishes as the target data size increases and the source–target gap widens. However, including carefully selected pre-training examples during fine-tuning tightens these generalization bounds, especially when target data are scarce (Liu et al., 2021).

2. Parameter-Efficient Fine-Tuning and Pruning

The drive for parameter efficiency motivates multiple methods that freeze large portions of the pre-trained model, allowing only minor components to adapt. Notable approaches include:

BitFit: Only bias terms and the classification head are optimized, with the backbone frozen. BitFit matches or outperforms full fine-tuning on standard NLP tasks such as GLUE, even with <0.2% trainable parameters (Doering et al., 2024).
Adapter modules: Small bottleneck networks are inserted after each Transformer sub-layer, with all original weights frozen. Adapter parameter count is on the order of 2% of the total model, with stable performance when optimally tuned (Doering et al., 2024).
Parameter-Efficient Fine-Tuning (PEFT) for 3D: In the 3D domain, Point-PEFT introduces domain-specific prompt tokens and geometry-aware adapters. Only 4–5% of parameters are adapted, achieving superior or equal accuracy to full fine-tuning on classification and segmentation tasks (Tang et al., 2023).
Structured Model Pruning: Frameworks such as TransTailor iteratively prune task-irrelevant filters (via Taylor-based importance scoring) and then re-optimize the sub-network to enhance target-data fit while reducing FLOPs and overcapacity. TransTailor produces smaller, tailored sub-models that surpass both pruning baselines and full fine-tuning on real-world benchmarks (Liu et al., 2021).

These methods are summarized in the following table:

Method	% Parameters Tuned	Notable Benefits
Full FT	100	Maximum capacity and potential adaptation
BitFit	~0.2	Stability, strong performance, resource-lite
Adapters	~2	Modular, multi-task flexibility
Point-PEFT (3D)	4–5	95–97% parameter savings, no accuracy drop
Pruning (TransTailor)	50–80 (pruning)	Smaller, faster, often improved accuracy

3. Regularization, Robustness, and Calibration

While naively fine-tuning a pre-trained model adapts it to downstream tasks, this can induce catastrophic forgetting, overfitting, and loss of robustness:

Catastrophic Forgetting & Distributional Robustness: End-to-end fine-tuning of the full backbone often degrades out-of-distribution (OOD) accuracy, even below that of the original pre-trained model. WiSE-FT-LP restores robustness by interpolating in weight-space between pre-trained and fine-tuned weights, followed by a frozen-backbone, linear head re-fit—preserving strong backbone features while recovering in-distribution performance (Zhang et al., 2024).
Semantic Drift and Overfitting: DR-Tune addresses feature drift between pre-trained and fine-tuned encoders. The classification head is regularized to maintain low loss on the original pre-trained feature distribution—calibrated via Procrustes rotation and class translation. Distribution regularization (head only) ensures smoother, more transferable decision boundaries (Zhou et al., 2023).
Feature Discrimination Alignment: In few-shot transfer, fine-tuning can distort generalizable features, especially under distribution shift. FD-Align explicitly aligns category-independent (spurious) feature subspaces during fine-tuning, aiming to preserve generalizability (Song et al., 2023).
Fine-Tuning and Class Subset Calibration: When fine-tuning is performed on a subset of original classes, the resulting model does not forget to distinguish absent classes, but their logits are underestimated, harming absent-class accuracy. A simple post hoc logit shift (γ) restores accuracy to or above pre-trained levels, showing that calibration, not forgetting, is central to transfer in this regime (Mai et al., 2024).

4. Specialization: Robustness, Fairness, and Noisy Labels

Robustness to Label Noise: High label noise can distort learned representations during fine-tuning. The TURN algorithm combines a probing/linear-head step to select clean data (via per-class GMM over CE loss) and then fine-tunes the entire model on this purified subset. TURN achieves better accuracy than sophisticated clean-label learning approaches at a fraction of their computational cost (Ahn et al., 2023).
Fairness Across Demographics: Vanilla fine-tuning may inadvertently increase demographic biases, even when the original pre-trained model is fair. By measuring Fisher information for each head parameter per demographic group and neutralizing group-imbalanced weights, followed by low-rank reparameterization, bias-mitigated fine-tuning reduces both fairness metric gaps (Δ{DP}, Δ{EO}) and parameter count by >90% (Zhang et al., 2024).

5. Algorithms Leveraging Structure, Causality, and Additional Pre-Training Data

Structure-aware and Concept-wise Fine-Tuning (for code and vision): SAT regularizes code models to align their first-layer attention maps to gold-standard AST distance matrices, delivering smoothed structure loss in multi-task training and direct BLEU/EM gains in code summarization and translation, especially in low-data regimes (Wu et al., 2024). In vision, Concept-Tuning maximizes mutual information across rare-feature (patch) channels and applies front-door causal adjustment—via dedicated attention and InfoNCE losses—to suppress spurious feature reliance and negative transfer, substantially outperforming previous SOTA (Yang et al., 2023).
Span Fine-Tuning and KNN-augmented Objectives: Span Fine-tuning for PrLMs partitions input sequences via an n-gram dictionary and augments transformer outputs with hierarchical CNN span features, significantly lifting GLUE and NER benchmarks without expensive span-based pre-training (Bao et al., 2021). KNN-BERT introduces a momentum-contrastive objective yielding clustered embeddings, then interpolates between a linear classifier and nearest-neighbor votes in representation space—outperforming the baseline under both clean and adversarial conditions (Li et al., 2021).
Re-use of Pre-Training Data and Optimal Transport Selection: Rather than discarding source data, including a subset of pre-training data selected via unbalanced optimal transport—matching class/cluster centroids between source and target domains—yields universally improved generalization bounds and up to 3% absolute gain in accuracy, particularly when target data are scarce (Liu et al., 2021).

6. Stability, Federated, and Hyperparameter-Optimized Fine-Tuning

Stability Analysis and Regularization: Full fine-tuning is unstable under dataset or random seed perturbations. Theoretical results show that stability increases with sample size, smoothness, and initialization closeness. Multi-head loss (ensembles of parallel linear heads), maximal-margin regularization, and self-unsupervised re-training (SURT) on the unlabeled target distribution demonstrably temper this instability and reduce run-to-run variance (Fu et al., 2023).
Federated and Communication-Efficient Fine-Tuning: In federated settings with resource-constrained clients, SFPrompt partitions the pre-trained model between client and server, updating only client-side parameters and soft prompts. Combined with sample-wise dataset pruning and local loss-driven updates, SFPrompt maintains near-oracle test accuracy with <1% of the local FLOPs and halves the communication cost vs. naïve federated fine-tuning (Cao et al., 2024).
Hyperparameter and Checkpoint Fusion: Direct averaging of NLP model checkpoints commonly fails due to the mismatch between loss and metric landscapes. Multi-objective Bayesian optimization in parameter space, optimizing for task metric(s) as well as loss, produces fused models (BOMF) outperforming all constituent checkpoints and uniform averages. Two-stage BO (outer for HP, inner for fusion) is computationally efficient because HP optima transfer across LoRA/freeze levels (Jang et al., 2024).

7. Practical Guidelines and Future Directions

Selection of Fine-Tuning Schemes: Full fine-tuning maximizes adaptation but incurs instability and parameter cost; BitFit and adapters suit resource constraints or streaming tasks (Doering et al., 2024). Layer freezing exploits the empirical finding that lower Transformer layers change minimally across downstream tasks—even in code models—allowing up to 80% parameter savings with negligible or positive performance delta (Shi et al., 2023).
Regularization and Robustness: Head-only regularization (DR-Tune, fairness methods), weight-space ensembling, and logit calibration provide improved trade-offs between ID accuracy, OOD robustness, and fairness—with little computational overhead.
Handling Specific Challenges: Algorithms such as TURN (for noisy labels), Concept-Tuning (for negative transfer), and ModelFusion-BOMF (for loss/metric landscape alignment) exhibit task-specific solution strategies rooted in theoretical and empirical insights.
Open Problems: Principled handling of semantic drift under extreme domain shift, automated selection of calibration parameters (e.g., per-class γ in logit correction), and generalization of causal structure-based regularizers remain unsolved. Deeper theoretical analysis of feature space evolution under partial or domain-limited fine-tuning is also needed.

Fine-tuning pre-trained models encompasses a spectrum of methods and considerations: parameter efficiency, safeguarding generalization, robustness to label and task noise, fairness, stability, specialization to domain structure, and computational constraints. Ongoing research continues to expand the theoretical and methodological toolkit for safely and efficiently adapting large, general-purpose models to specific, and sometimes adverse, downstream environments.