Parameter-Efficient Learning (PET)

Updated 8 December 2025

Parameter-Efficient Learning (PET) is a set of adaptation strategies that insert small, task-specific modules like adapters, prompts, or LoRA into pre-trained models while keeping most weights frozen.
It significantly cuts down training cost, memory footprint, and catastrophic forgetting, making it ideal for both single-task and continual learning across domains like NLP and vision.
PET leverages automated module placement and structural optimization techniques to enhance efficiency, achieving competitive performance with minimal additional parameters.

Parameter-Efficient Learning (PET) describes a collection of adaptation protocols for transferring large, pre-trained models to downstream tasks by introducing a set of new task-specific parameters, while keeping the majority of the original model weights frozen. PET allows for rapid, memory-aware adaptation of foundation models—including LLMs, vision transformers, ConvNets, and multimodal encoders—for both single-task and continual learning regimes. PET comprises modules such as adapters, prompts, low-rank updates (LoRA), and various forms of attention or residual modifications, which can be inserted at distinct positions within the backbone model. This strategy dramatically reduces the training cost, storage footprint, and catastrophic forgetting compared to full-model fine-tuning, and supports rapid instantiation of a large number of personalized or task-adapted models.

1. Core Principles and Methodological Taxonomy

Parameter-efficient adaptation involves designing PET modules (or "blocks") to be trained independently per task while the backbone parameters Θ are held fixed. Standard PET modules include:

Prompt-Tuning: Learnable tokens P ∈ ℝ^{ℓ×d} are prepended to the input sequence and only these parameters are updated; the remaining model weights are static. This paradigm is widely used for NLP and has been ported to vision in visual prompt tuning (e.g., VPT, PATT) (Zhao et al., 16 Jan 2024, Yu et al., 2022).
LoRA (Low-Rank Adaptation): Within each targeted linear (or convolutional) layer, LoRA introduces a low-rank residual ΔW = A B, with A ∈ ℝ^{r×k}, B ∈ ℝ^{d×r}, where r ≪ min(d, k). Only A and B are trainable (Zhao et al., 16 Jan 2024, Du et al., 11 Dec 2024).
Adapters: Small bottleneck networks, typically with down- and up-projection matrices following a nonlinearity. The adapters are inserted in parallel to existing layers, often after attention or MLP sublayers (Gao et al., 2023, Chen et al., 2022, Yu et al., 2022).
Prefix-tuning: Task-specific prefixes are added to queries/keys/values in attention schemes, often generated via a learned MLP (Yu et al., 2022).

PET modules can be arbitrarily placed within the network (Hu et al., 2022, Su et al., 2023), and their instantiation, placement, and type can be optimized via differentiable search (e.g., S³PET) to maximize accuracy under parameter constraints.

2. Continual and Lifelong Learning with PET

PET is a primary strategy for continual learning (CL) in large models, enabling both task-specific adaptation and global knowledge retention. In the context of CL, the main challenge is mitigating catastrophic forgetting, balancing stability and plasticity, and enabling forward/backward transfer.

Representative frameworks include:

SAPT: Maintains a pool of PET blocks (e.g., LoRA or prompts) for each task, and uses a shared attention mechanism to aggregate and select these blocks at both training and inference. The attention is computed via a learned query q_t and per-block keys k_i, with softmax mixing across tasks. The learning objective combines task loss and a KL regularization term to preserve historical attention distributions, thus virtually eliminating forgetting (Zhao et al., 16 Jan 2024).
LAE: Comprises three stages—learning (PET per new task), accumulation (momentum update of PET module for integration across tasks), and ensemble (combining online and offline PET experts at inference by confidence fusion). LAE is PET-agnostic and shows strong continual learning performance for diverse blocks (Adapters, LoRA, Prefix) (Gao et al., 2023).
SAFE: Splits PET parameters into "slow" (stability-focused, fixed after initial CL session) and "fast" (plastic, adapted per new task) modules. Feature alignment and cross-classification losses enforce compatibility between slow and fast learners, with entropy-based aggregation at inference to dynamically weigh predictions (Zhao et al., 4 Nov 2024).
Gradient Projection (PEGP): Introduces orthogonal gradient projection for PET parameter updates to ensure that new task learning is orthogonal to the feature subspace of past tasks, guaranteeing minimal forgetting. PEGP applies in principle to all main PET paradigms (Qiao et al., 22 May 2024).
HiDe-PET: Hierarchically decomposes the continual learning objective into within-task prediction, task-identity inference, and task-adaptive prediction, and assigns task-specific and task-shared PET modules to explicitly optimize each component. Auxiliary representation memory (class centroids or variance statistics) enables efficient recovery of representations and further reduction of forgetting (Wang et al., 7 Jul 2024).

3. Parameter-Efficiency and Task Adaptation Metrics

PET blocks reduce the number of trainable parameters relative to full fine-tuning by at least one to three orders of magnitude. Parameter budgets for representative modules are:

PET Module	Parameters per Layer	Typical Usage in PLMs
Prompt-Tuning	ℓ × d	ℓ=10–300, d=768–4096
LoRA	2 × r × (d + k)	r=4–16, d,k=768–4096
Adapter	2 × d × d_b (bottleneck)	d_b=16–96, per-layer

Practical experiments indicate that PET methods such as LoRA or prompt-tuning add only ≈0.2–0.4M parameters to models with ≈100–1000M backbone parameters (Du et al., 11 Dec 2024, Yu et al., 2022, Ding et al., 28 Nov 2024). Table 1 (abridged from (Du et al., 11 Dec 2024)) highlights typical parameter budgets and accuracy outcomes:

Method	Extra Params	Top-1 Acc (FGVC, ViT-B/16)
Full-tuning	85.98M	88.54%
ALoRE	0.15M	91.60%
LoRA	0.44M	–
SSF	0.39M	–

Recent work demonstrates that PET outperforms or matches full fine-tuning in regimes with few-shot data, small memory, or domain shift, while dramatically improving training and memory efficiency (Du et al., 11 Dec 2024, Zhao et al., 2023, Zhang et al., 9 Jul 2024, Ding et al., 28 Nov 2024). Notably, extensions such as Dynamic Subset Tuning support per-step re-selection of “active” parameters under a strict budget with competitive or superior performance relative to LoRA or Prompt-Tuning, and smoothly interpolate over the full parameter-efficiency spectrum (Stahlberg et al., 13 Nov 2024).

4. PET in Vision, Music, Multimodal, and Dense-Prediction Architectures

PET methods have been systematically extended from NLP to vision and multimodal domains:

Vision Transformers (ViT): Adapters (AdaptFormer), LoRA, prompt-tuning (VPT), and prefix-tuning variants (PATT, BAPAT) are applied at specific positions (attention, FFN, normalization) based on empirically measured positional sensitivity (Yu et al., 2022, Zhang et al., 9 Jul 2024, Du et al., 11 Dec 2024). For ConvNets, Conv-Adapter uses depthwise and pointwise bottlenecks to maintain spatial locality (Chen et al., 2022).
Dense Prediction (Object Detection/Segmentation): PET can be deployed for cascading Mask R-CNN and UPerNet backbones, providing up to 60% reduction in peak training memory and 26% in computation time (E³VA), with no more than 0.5–1.5% drop relative to full fine-tuning (Yin et al., 2023).
Music Foundation Models: Adapters, LoRA, and prompt-based PET outperform both probing and full fine-tuning on auto-tagging, with an order of magnitude less training cost (Ding et al., 28 Nov 2024).
Vision-LLMs: PETL enables rapid adaptation of large vision-LLMs (e.g., CLIP) for cross-modal retrieval by inserting multimodal adapters and contrastive objectives, achieving >98% reduction in parameters and significant improvements in retrieval metrics (Yuan et al., 2023).

5. Automated Search and Structural Optimization in PET

Discovering the optimal PET configuration is itself a nontrivial combinatorial optimization problem. S³PET formalizes this as a bi-level differentiable architecture search, introducing binary gating over module positions and continuous relaxation (Binary Concrete distribution) to enforce strict parameter budgets and globally optimal allocation (Hu et al., 2022). The search optimizes parameter placement, type (LoRA, Adapter, Bias, LN-scale), and layer, generally producing search outcomes that significantly outperform manually- or randomly-assigned PET structures, especially under extremely tight budgets (as low as 0.01% of backbone parameters).

6. Theoretical Perspectives, Regularization, and Model Scaling

PET can be recast in optimal control terms: the backbone is seen as a dynamical system, PET parameters as controls, and learning minimizes a sum of end-state cost (e.g., task loss) and, sometimes, intermediate “running cost” terms (Chen et al., 2023). Recent advances use latent stochastic bridges as regularizers over hidden-state trajectories, producing performance and generalization gains across PET types.

Scaling laws for PET modules reveal that, as model size increases, the sensitivity to parameter position disappears and the number of necessary tunable parameters to reach full-tuning accuracy decreases. On large backbones (3B – 11B), randomly allocated or sparsely assigned PET blocks perform nearly as well as optimal placements, and phase transitions in adaptation occur at fixed threshold budgets (Su et al., 2023).

7. PET Extensions: Speed, Memory, Modularity, and Application Domains

PET’s parameter efficiency is increasingly complemented by techniques targeting inference/runtime efficiency and modularity:

Token Redundancy Reduction (FPET): PET modules equipped with plug-in token merging at mid-network layers cut both compute (FLOPs) and memory by up to 48%, without compromising accuracy (Kim et al., 26 Mar 2025).
Disentangled Approaches (SynQT): Factorization into task-specific query synthesis and frozen, query-only feature extractors enables training with drastically reduced memory, suitable for memory-constrained platforms (Zhang et al., 9 Jul 2024).
Few-shot Modular Tuning (PETapter): Modular PET-style heads plug atop existing PEFT backbones (e.g., Adapters, LoRA), maintaining interpretability, shareability, and competitive few-shot accuracy with minimal GPU utilization (Rieger et al., 6 Dec 2024).

References

SAPT: "SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of LLMs" (Zhao et al., 16 Jan 2024)
ALoRE: "ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts" (Du et al., 11 Dec 2024)
LAE: "A Unified Continual Learning Framework with General Parameter-Efficient Tuning" (Gao et al., 2023)
FPET: "Faster Parameter-Efficient Tuning with Token Redundancy Reduction" (Kim et al., 26 Mar 2025)
S³PET: "Sparse Structure Search for Parameter-Efficient Tuning" (Hu et al., 2022)
SAFE: "SAFE: Slow and Fast Parameter-Efficient Tuning for Continual Learning with Pre-Trained Models" (Zhao et al., 4 Nov 2024)
Conv-Adapter: "Conv-Adapter: Exploring Parameter Efficient Transfer Learning for ConvNets" (Chen et al., 2022)
SDE regularizers: "Stochastic Bridges as Effective Regularizers for Parameter-Efficient Tuning" (Chen et al., 2023)
Model scaling: "Exploring the Impact of Model Scaling on Parameter-Efficient Tuning" (Su et al., 2023)
HiDe-PET: "HiDe-PET: Continual Learning via Hierarchical Decomposition of Parameter-Efficient Tuning" (Wang et al., 7 Jul 2024)
Dynamic Subset Tuning: "Dynamic Subset Tuning: Expanding the Operational Range of Parameter-Efficient Training for LLMs" (Stahlberg et al., 13 Nov 2024)
PETapter: "PETapter: Leveraging PET-style classification heads for modular few-shot parameter-efficient fine-tuning" (Rieger et al., 6 Dec 2024)
SynQT: "Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach" (Zhang et al., 9 Jul 2024)