Parameter-Efficient Transfer Learning

Updated 3 March 2026

The topic parameter-efficient transfer learning is defined as methods that adapt large pre-trained models to new tasks by updating only a small fraction (0.1–5%) of parameters.
It employs techniques such as adapters, LoRA, and prompt tuning to reduce memory and storage requirements while maintaining near full fine-tuning accuracy.
PETL offers scalable, robust solutions for multi-task, lifelong, and federated learning, making it ideal for resource-constrained environments like edge devices.

Parameter-efficient transfer learning (PETL) refers to a collection of methodologies that enable the adaptation of large pre-trained models to multiple downstream tasks by training or introducing only a small, task-specific set of additional parameters per task, while keeping the majority of the base model’s parameters fixed. Unlike full fine-tuning—which requires a separate, fully updated model for every task—PETL methods drastically reduce both memory usage and storage requirements, facilitate multi-task and continual learning, and support scalable deployment in multi-tenant environments. PETL has achieved near-parity or even surpassed full fine-tuning in accuracy on a diverse set of benchmarks, all while updating as little as 0.1–5% of the underlying model weights (Houlsby et al., 2019, Zeng et al., 2023, Lin et al., 2024, Nguyen et al., 4 Apr 2025, Mai et al., 2024).

1. Core Concepts and Motivations

The growth of model size in natural language processing, vision, speech, and multimodal applications has made standard full fine-tuning increasingly inefficient for practical deployments. Storing a separate copy of a multi-hundred-million-parameter model for each task or user is both computationally and memory prohibitive. PETL methods address these challenges by:

Promoting parameter and storage efficiency: Only a tiny task-specific “delta” (e.g., adapters, prompt vectors, low-rank updates) is learned per task, reducing the tuning and storage overhead to as little as 0.04–3% of the model size (Houlsby et al., 2019, Mai et al., 2024).
Enabling rapid multi-task and lifelong adaptation: New tasks can be added without risk of catastrophic forgetting, since the backbone remains fixed, and the memory footprint per task scales linearly with the number of trainable parameters in the PETL module (Houlsby et al., 2019).
Improving generalization and robustness: By limiting the capacity of adaptation and leveraging the powerful general-purpose features of pre-trained backbones, PETL provides strong implicit regularization—especially advantageous in low-data or out-of-domain settings (Mai et al., 2024).
Supporting practical constraints: PETL is especially advantageous for edge devices, federated learning, privacy-sensitive settings, and multi-user systems where updating and storing full models is infeasible (Mudrakarta et al., 2018, Shysheya et al., 2022).

2. Representative PETL Methodologies

PETL encompasses a rich family of architectural techniques, with the most prominent being adapters, prompt-based methods, low-rank reparameterization, masking/pruning, hypernetwork-based tuning, and hybrid/conditional approaches.

Method	Updateable Params (% of base)	Key Mechanism
Houlsby Adapter	1–4%	Bottleneck MLPs after attention/FFN
LoRA	0.5–2%	Low-rank decomposition of weights
Prompt/Prefix	<0.1%	Prepend trainable tokens
BitFit, LayerNorm	0.04–0.2%	Tune bias/scale only
Masked PETL	0.01–1% (BLS)	Binary masking of shared PETL blocks
Hypernetworks	0.05%	Generate adapters per layer or task
S2A/Conditional	<1%+	Structure-/activation-efficient PETL

2.1 Adapter Modules

Introduced in NLP with BERT, adapter modules are small feed-forward networks (typically two-layer bottlenecks) inserted in parallel or sequentially after each sub-layer (e.g., attention or FFN). The canonical bottleneck adapter applies a learnable down-projection, nonlinearity, and up-projection to the sub-layer’s output, with residual addition (Houlsby et al., 2019). For BERT_BASE and m = 64, only 1.1% of parameters are updated per task, with GLUE performance within 0.4% of full fine-tuning.

2.2 Prompt and Prefix Tuning

Prompt tuning adapts only a set of trainable “soft tokens,” appended at various locations (input, per-layer keys/values) in Transformer models. It thereby exposes the backbone to new tasks by conditioning it on learned synthetic context (Ding et al., 2024, Yu et al., 2022). Prefix or deep prompt tuning operates at each layer, with minimal (<0.1–0.5%) parameter cost, but sometimes lower maximum capacity.

2.3 Low-Rank Adaptation (LoRA) and Reparameterization

LoRA and similar techniques (e.g., FacT) compute rank-constrained updates to key projection matrices (typically W_q, W_v) as ΔW = A·B, with A ∈ ℝ^{d×r}, B ∈ ℝ^{r×d}, where r ≪ d. Only the small factors are learned per-task, allowing sub-1% overhead with minimal computation and high accuracy (Ding et al., 2024, Cappellazzo et al., 2023).

2.4 Masked and Shared PETL

PROPETL (Zeng et al., 2023) learns a single set of PETL weights (adapter, LoRA, etc.) and uses learned 1-bit binary masks to extract sparse, task- or layer-specific sub-networks. This achieves 9–30× storage savings (down to 0.1% of bit-level storage) and reveals systematic overparameterization even within small PETL modules.

2.5 Hypernetwork and MoE-Based PETL

SaS (Nguyen et al., 4 Apr 2025) uses a combination of (a) a single low-rank, cross-layer shared module, and (b) compact hypernetworks to generate per-layer, task-specific module weights, achieving strong performance at <0.05% parameter cost. In multi-task settings, PEMT (Lin et al., 2024) leverages a mixture-of-experts (MoE) framework to combine adapters from diverse source tasks via a learned gating mechanism indexed by task correlation.

2.6 Conditional, Structured, and Memory-Efficient PETL

Methods such as S2A (Jin et al., 11 Mar 2025) and CoDA (Lei et al., 2023) target GPU memory and inference speed by combining bias-only/side modules and quantizing activations, or routing activations through only a subset of the network at each layer, achieving 4–10× memory and 2–8× inference throughput reductions.

3. Parameter Efficiency, Storage, and Computation Trade-offs

Parameter-efficient transfer learning reduces incremental memory/storage requirements per task. For example, adapter-tuning on BERT_LARGE for GLUE requires only 3.6% new parameters per task (12 M of 330 M), whereas full fine-tuning requires an entire copy per task. PETL’s effectiveness is evident in multi-task and lifelong learning: the total storage needed for N tasks is P_base + N·P_PETL (compared to N·P_full in standard fine-tuning) (Houlsby et al., 2019, Zeng et al., 2023).

Bit-level storage can be further minimized in masked PETL approaches, as in PROPETL, by sharing 32-bit PETL weights across layers/tasks and storing only task-specific binary masks (~0.01–0.1% bit-level storage) (Zeng et al., 2023). In speech and TTS models, inserting adapters/Hypers after every block enables rapid multilingual or cross-speaker adaptation with 2.5% of weights (Li et al., 2024).

Computation overhead is also task-dependent. Some PETL designs (e.g., S2A, CoDA) are intentionally crafted to reduce training/inference cost by activating only a subset of the backbone or restructuring module interactions to minimize memory or FLOPs (Jin et al., 11 Mar 2025, Lei et al., 2023).

4. Empirical Performance and Domain-Extensive Studies

Careful ablation and benchmarking have demonstrated that PETL methods, when properly tuned, consistently match or outperform full fine-tuning, especially on low-shot or out-of-domain tasks. For example, on the VTAB-1K suite with ViT-B/16, all leading PETL methods—Adapters, LoRA, BitFit, ConvPass, RepAdapter—deliver within <1.5% accuracy of each other and outperform full fine-tuning by 2–5 points, with only a tiny fraction (0.04–1.5%) of parameters updated (Mai et al., 2024).

PETL achieves these results across modalities:

NLP: GLUE, SQuAD, and other text classification/QA tasks: adapters reach 99.5% of full fine-tuning performance at ~1–4% parameter cost (Houlsby et al., 2019).
Vision: Image and video classification: PETL methods match or beat full FT on VTAB, CIFAR-100, and CLEVR at 0.2–2% updated weights (Yu et al., 2022, Mai et al., 2024, Nguyen et al., 4 Apr 2025).
Speech & Audio: Speech recognition, audio tagging, intent detection: adapters and LoRA deliver near-parity with full FT at 0.3–8% parameter cost, with adapter-style modules often best in few-shot settings (Cappellazzo et al., 2023, Li et al., 2023, Otake et al., 2022).
Multimodal/Domain-specific: Remote sensing retrieval, vision-language navigation, music foundation models: PETL achieves parity with or supersedes FT while updating <3% of weights (Yuan et al., 2023, Qiao et al., 2023, Ding et al., 2024).

PEFT delivers additional benefits:

Complementary error patterns: Even when averages match, different PETL modules yield distinct prediction boundaries, motivating model ensembling (Mai et al., 2024).
Robustness to distribution shift: PETL outperforms full FT in OOD scenarios (e.g., CLIP shifts), with further gains possible by linear interpolation in weight or logit space (Mai et al., 2024).
Scalability in multi-task/federated settings: Modular PETL paradigms such as patch-based tuning (Mudrakarta et al., 2018) and FiLM-based federated personalization (Shysheya et al., 2022) enable rapid adaptation while transmitting only minimal model updates.

5. Practical Design, Limitations, and Recommendations

Best-practice guidelines for efficient and effective PETL include:

Method selection by budget and task: For tight parameter constraints (<0.1%), tune only biases and LayerNorm (BitFit, LayerNorm-Tune); for high capacity with modest budget (0.5–2%), use LoRA, Adapter, or ConvPass; for minimal active memory or hardware constraints, apply quantized/structured PETL (S2A, CoDA) (Mai et al., 2024, Jin et al., 11 Mar 2025).
Insertion points matter: In Transformers, adapters are most effective in upper layers and after attention/FFN, while in vision/spectrogram models, parallel modules before attention or in MLPs perform best (Houlsby et al., 2019, Cappellazzo et al., 2023).
Layer-wise and cross-layer parameter sharing: Hypernetwork and binary mask-based approaches enable effective trade-offs between specificity and capacity while minimizing parameter count (Zeng et al., 2023, Nguyen et al., 4 Apr 2025).
Distribution alignment: Two-stage approaches that first re-align LayerNorm statistics and then tune only the most task-relevant channels yield robust gains, especially across large domain shifts (Zhao et al., 2023).
Ensembling for accuracy: Because PETL modules induce complementary inductive biases, simple ensemble schemes of multiple PETL methods yield consistent gains over any single method (Mai et al., 2024).
Task- and domain-specific modules: For structure- or context-sensitive tasks (e.g., vision-language navigation), VI-specific PETL modules (e.g., Historical/Cross-modal Boosters) are effective (Qiao et al., 2023).

Known challenges and open questions include:

Optimal selection of PETL ranks, module size, and location is backbone- and task-dependent and may require dataset-specific tuning (Nguyen et al., 4 Apr 2025).
Trade-off between capacity and induced regularization: increasing PETL size improves performance only up to a certain threshold.
For large-scale multi-expert approaches, inference cost scales with the number of source adapters, requiring future work on pruning or reparameterization (Lin et al., 2024).

6. Theoretical Understanding and Mechanisms

Recent theoretical analyses have begun to clarify when PETL (and more generally, parameter-transfer) yields gains versus when it induces negative transfer. Key findings include:

Universal feature alignment: Transfer is highly beneficial when upstream and downstream tasks share strong universal features; parameter efficiency amplifies utility when the shared subspace is well-aligned (Yuan et al., 26 Sep 2025).
Overparameterization in PETL modules: Even small adapters or LoRA blocks are often overparameterized, as revealed by masking/prototyping methods; significant pruning is possible with no accuracy drop (Zeng et al., 2023).
Distributional shift alignment: Shared low-rank modules (e.g., in SaS) mitigate pretrain–downstream mismatch, explaining their superior performance relative to per-layer-only PETL (Nguyen et al., 4 Apr 2025).

7. Future Directions

Active areas of research include:

Unified PETL frameworks: Generalized U-Tuning frameworks enable mathematically clean compositional PETL, supporting future modularity and extensibility (Jiang et al., 2023).
Dynamic and context-aware PETL: Hypernetworks, mixture-of-experts, and conditional activation PETL enable dynamic adaptation to task, instance, or layer-level signals (Lin et al., 2024, Nguyen et al., 4 Apr 2025, Lei et al., 2023).
Activation- and memory-efficient PETL: Structured design and quantization (e.g., S2A) enable PETL for resource-constrained deployment while maintaining accuracy (Jin et al., 11 Mar 2025).
Cross-modal and foundation model PETL: Parameter efficiency is being extended to large multimodal, music, and speech foundation models, with new benchmarks and methods (Ding et al., 2024, Cappellazzo et al., 2023).
Theoretical guarantees: Characterizing the statistical and computational regimes in which PETL yields improvements, including necessary and sufficient conditions for positive transfer, remains an open theoretical endeavor (Yuan et al., 26 Sep 2025).
Ensemble and robustness techniques: Weight or logit space ensembles further enhance OOD robustness and accuracy, suggesting hybrid PETL-FT deployment strategies (Mai et al., 2024).

Parameter-efficient transfer learning, by reconciling scalability, accuracy, and resource constraints, has become a cornerstone schema for leveraging foundation models at scale, laying the groundwork for continual, federated, and multi-domain model adaptation in a wide variety of real-world contexts.