Parameter-Efficient Finetuning

Updated 3 January 2026

Parameter-efficient finetuning is a strategy that adapts large pre-trained models by updating only a small subset of parameters or adding lightweight modules, preserving most original weights.
It distinguishes between addition-based methods (e.g., adapters, LoRA, weighted sums) and selection-based approaches (e.g., GPS, FPS, magnitude-based masks), each tuned to balance efficiency and performance.
This approach offers significant reductions in computational and storage demands while maintaining or even surpassing full-model finetuning accuracy across tasks in NLP, vision, speech, and medical imaging.

Parameter-efficient finetuning (PEFT) encompasses a class of methods that adapt large pre-trained models to downstream tasks by updating only a small subset of parameters or introducing lightweight modules, while keeping the majority of the original model weights frozen. PEFT was developed in direct response to the computational, storage, and generalization limitations of full-model finetuning, particularly in resource-constrained scenarios and domains where labeled data is limited. PEFT methods are now established as state-of-the-art approaches for transfer learning across domains such as natural language processing, vision, speech, and medical imaging, and have catalyzed both theoretical and practical advancements in efficient adaptation of foundation models.

1. Core Methodologies and Architectures

PEFT includes both addition-based and selection-based approaches. Addition-based methods introduce new trainable modules (adapters, low-rank factors, or prompt embeddings), while selection-based methods fine-tune a task-dependent or task-agnostic sparse subset of existing parameters.

Major Addition-Based Approaches

Adapters (Series/Parallel): Compact bottleneck MLP modules inserted after each Transformer sublayer. Update rule: $h' = h + U\,\phi(V h)$ , typically with $V \in \mathbb{R}^{m \times d}$ down-projection, nonlinearity $\phi$ , and $U \in \mathbb{R}^{d \times m}$ up-projection. Only $U, V$ are trained (Lashkarashvili et al., 2024).
Low-Rank Adaptation (LoRA): Augments selected weight matrices with a rank- $r$ trainable update: $W' = W + BA$ , $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ with $r \ll \min(d, k)$ . Applied to attention and feed-forward blocks. Only $A, B$ are trained (Lashkarashvili et al., 2024).
Weighted Sum (WS): Learns one scalar per layer, aggregates outputs as $h_{\rm out} = \sum_{l=1}^L \beta_l h^{(l)}$ , where each $\beta_l$ is a softmax-normalized trainable scalar (Lashkarashvili et al., 2024).
Gate/Mask-based (Weight Gating, PEFT masks): Learn per-feature or per-weight scalars to gate the output, e.g., $h' = \sigma(g) \odot h$ (Lashkarashvili et al., 2024).

Major Selection-Based Approaches

Gradient-based Parameter Selection (GPS): Ranks parameters by task-specific gradient magnitude, selects top- $k$ for unstructured or neuron-level fine-tuning, freezes the rest (Zhang et al., 2023).
Feedforward-based Parameter Selection (FPS): Ranks weights by $|w| \cdot |a|$ (weight magnitude times input activation, averaged over data) using only forward passes. Selects top- $k$ for fine-tuning (Yang et al., 31 Oct 2025).
Magnitude-based/Task-agnostic Sparse Masks: Selects top- $k$ smallest (in magnitude) weights to update, mask shared across all tasks (PaFi) (Liao et al., 2023).
SAM (Second-order Approximation Method): Selects coordinates maximizing surrogate gradient/Hessian ratio $(\nabla L_i)^2 / h_i$ from a second-order Taylor expansion (Fu et al., 2022).

These architectures may be composed (e.g., combinations of adapters, LoRA, and soft gating), or hybridized with frequency-domain/spectral methods (e.g., sDCTFT, CDVFT) for parameter reductions based on signal compaction (Shen et al., 2024, Ding et al., 1 May 2025).

2. Parameter and Computational Efficiency

PEFT methods are distinguished by the small fraction $r = \frac{\text{trainable}}{\text{total}}$ of parameters tuned per task, and by the reduction in memory, storage, and FLOPs compared to full finetuning (FT).

Method	Extra Params (%)	Inference Latency	FLOPs Overhead	Notable Figures
Full FT	100	1.0x	1.0x	90M (wav2vec2), 311M (HuBERT) (Lashkarashvili et al., 2024)
Bottleneck Adapter (BA)	1.0–1.3	1.01x	$+$ small	1.2M/3.2M (Lashkarashvili et al., 2024)
LoRA	1.1–1.4	1.0x	1.0–1.2x	1.3M/3.5M (Lashkarashvili et al., 2024)
WS+WG (only)	$<0.01$	1.0x	$+$ negligible	12–25 params (Lashkarashvili et al., 2024)
BA+LoRA+WS+WG	2.2–2.8	1.01x	$+$ small	2.5M/6.7M (Lashkarashvili et al., 2024)
Sparse, GPS/FPS	0.25–0.77	1.0x	1.0x	0.25–0.77% (Zhang et al., 2023, Yang et al., 31 Oct 2025)
LayerNorm-only	0.015	1.0x	none	51k/333M (BERT) (ValizadehAslani et al., 2024)
sDCTFT	$10^{-3}$ – $10^{-2}$	1.01x	$+$ DCT/iDCT	50k/8B (LLaMA3.1-8B) (Shen et al., 2024)

On standard speech and language tasks, BA+LoRA+WS+WG achieves task or slightly better accuracy than full FT at $r \leq 3\%$ (e.g., 71.88% vs 68.53% on IEMOCAP for HuBERT), and selection-based methods (GPS, FPS) achieve up to $+3.2\%$ over full FT in vision (Lashkarashvili et al., 2024, Zhang et al., 2023, Yang et al., 31 Oct 2025). LayerNorm-only PEFT realizes near-full performance with an order of magnitude fewer parameters than BitFit (ValizadehAslani et al., 2024).

3. Empirical Performance Across Domains

PEFT methods have been rigorously evaluated in diverse domains:

Speech Emotion Recognition (SER): BA+LoRA+WS+WG yielded 3.4% better accuracy on HuBERT and 0.7% on wav2vec2 over full FT, with only 2–3% of parameters updated. For dimensional attributes (CCC), BA+LoRA matches or slightly trails FT for valence/arousal/dominance (Lashkarashvili et al., 2024).
NLP (GLUE, BERT): LayerNorm-only tuning achieves dev/test accuracy within 1–2 points of full FT on all metrics, and BitFit performs similarly with a slightly higher parameter budget (ValizadehAslani et al., 2024).
Vision (ViT, FGVC, VTAB-1k): Sparse selection-based GPS/FPS outperform both full FT and addition-based PEFT (e.g., Adapter, LoRA), yielding +3.24% and +1.11% mean accuracy gains on FGVC and VTAB-1k at $<1\%$ parameters (Zhang et al., 2023, Yang et al., 31 Oct 2025).
Domain Adaptation for SER: Two-stage PEFT (Stage 1: acted data, Stage 2: natural data with BA/LoRA frozen) mitigates catastrophic forgetting and dramatically improves natural domain (e.g., IEMOCAP improvised gains from 48.4% $\rightarrow$ 58.9–73.3%, retaining source accuracy) (Lashkarashvili et al., 2024).
Parameter Budget Sensitivity: PEFT performance typically saturates at $r \simeq 2–3\%$ for medium-sized models and tasks; further parameter increases yield diminishing returns or promote overfitting. Gating modules (WS, WG) can provide $\gg$ 0 accuracy gain for $r \ll 0.01\%$ in resource-constrained settings (Lashkarashvili et al., 2024).

4. Theoretical Foundations and Selection Criteria

Multiple works interpret PEFT as a form of sparse regularization, trading off expressivity with algorithmic stability and generalization (Fu et al., 2022):

Sparsity as regularization: Sparsifying updates imposes a quadratic $\ell_2$ penalty on frozen coordinates, bounding pointwise hypothesis stability. Theorem: for $p$ -sparsity, stability increases as $p \rightarrow 0$ , enhancing generalization up to the underfitting threshold.
Parameter selection: Rule-based PEFT (BitFit, adapters, LoRA) is agnostic to task data. Projection-based and data-adaptive methods (GPS, SAM, FPS) select coordinates by task-specific gradients, Fisher information, or activation–weight products,
- GPS: selects highest magnitude gradient entries after a proxy loss (Zhang et al., 2023).
- FPS: forward-only, selects highest $|w| \cdot |a|$ scores (Yang et al., 31 Oct 2025).
- SAM: selects entries maximizing $(\nabla L_i)^2/h_i$ (diagonal Hessian) (Fu et al., 2022).
- Fisher LayerNorm: Fisher-information-based pruning of LayerNorm parameters achieves negligible loss at extreme sparsity (ValizadehAslani et al., 2024).

Empirically, selection-based PEFT approaches often outperform or match full FT while offering a direct trade-off between efficiency and expressivity.

5. Specialized Adaptation and Domain/Task Generalization

PEFT is extensible to transfer settings and underpins state-of-the-art stabilization and adaptation recipes:

Domain adaptation: Two-stage learning (Stage 1: adaptation to source-rich domain; Stage 2: partial update on target, freezing "core" PEFT modules like LoRA/BA) preserves source knowledge while injecting target signals. This strategy prevents catastrophic forgetting and is validated on cross-corpus SER (Lashkarashvili et al., 2024).
Multilingual/continual adaptation: Adapter-based PEFT with parameter-level learning rate scaling (e.g., LAFT-URIEL, which uses URIEL-based distances for adaptive LR) increases multilingual positive transfer while reducing catastrophic forgetting in continual learning (Badola et al., 2022).
Instruction tuning: In LLMs, BA+LoRA+WS+WG or LayerNorm-only tuning is sufficient to support broad transfer and memorization, though performance on complex reasoning tasks may require higher parameter budgets or alternative strategies (e.g., sDCTFT, spectral PEFT) (He, 2024, Shen et al., 2024).
Edge devices and CNNs: Adapter-based PEFTs (e.g., LoRA, DoRA) port to convolutional models, saving up to 95% of update FLOPs on edge-optimized architectures despite only achieving half the memory gains observed in Transformer LLMs (Slamanig et al., 31 Jul 2025).

6. Practical Guidelines and Implementation Considerations

A robust set of implementation best practices and domain-specific recommendations has emerged:

Default PEFT Configurations: For speech and language, use BA+LoRA+WS+WG as the default; if memory is "tight," WS+WG alone ( $\ll$ 0.01% params) still provides significant gains (Lashkarashvili et al., 2024).
Selection-based Efficiency: When possible, favor data-driven subnetwork selection for minimal storage, latency, and full task adaptation (Yang et al., 31 Oct 2025, Zhang et al., 2023).
Adapter placement & size: For transformers, place adapters after feed-forward blocks; select bottleneck rank or hidden size to match memory constraints (common ranks: 4–16; BA/LoRA hidden size: $m \ll d$ ) (Lashkarashvili et al., 2024).
Parameter freezing in adaptation: In domain or task transfer (e.g., acted $\rightarrow$ natural emotion), freeze core capacity (e.g., BA+LoRA) in adaptation stage to retain source knowledge (Lashkarashvili et al., 2024).
Hyperparameters: Learning rates for PEFT modules can often be increased by 10 $\times$ over full FT due to lower risk of overfitting; use early stopping to prevent under- or over-training (ValizadehAslani et al., 2024).
When to prefer selection-based: For federated scenarios, on-device requirements, or when inference latency must be strictly unchanged, selection-based PEFT (e.g., GPS, FPS, PaFi) is ideal.

7. Open Challenges and Emerging Directions

Open questions remain concerning:

Task/domain-agnostic PEFT: Unified masks or adapter placements that achieve robust transfer across tasks/domains.
Subnetwork sharing and multi-task PEFT: How to optimally select overlapping parameter subsets for multitask or continual adaptation.
Interaction with quantization/compression: Combining PEFT with QAT or low-bit inference, especially for edge or federated deployments.
Theoretical bounds on expressivity: Precise characterization of the minimal parameter budget required per domain/task to match full FT.
Automated architecture search and scheduling: AutoPEFT pipelines to select PEFT primitives, parameter counts, and update schedules based on validation curves or layer sensitivities (Chen et al., 2023, He et al., 2023).

Across domains, PEFT now represents the state-of-the-art approach for scalable, accurate adaptation of foundation models, offering a principled trade-off between resource constraints and adaptation fidelity (Lashkarashvili et al., 2024, Zhang et al., 2023, ValizadehAslani et al., 2024, Yang et al., 31 Oct 2025, Balne et al., 2024).