Papers
Topics
Authors
Recent
2000 character limit reached

Additive PEFT: Modular Fine-Tuning for Large Models

Updated 17 January 2026
  • Additive PEFT is a modular adaptation method that appends lightweight, trainable modules to a fixed large model for efficient task specialization.
  • It leverages techniques such as Adapters, LoRA, and soft prompts to achieve minimal training cost while preserving the backbone's performance.
  • This approach supports scalable deployment across languages, vision, and speech domains by reducing both compute and memory overhead.

Additive parameter-efficient fine-tuning (PEFT) comprises a family of techniques for adapting large pre-trained models by introducing small, trainable modules whose outputs are added to the model’s main computational graph. In contrast to sparse or selective tuning—where only a subset of existing weights is updated, or reparameterization methods—additive PEFT leaves the backbone parameters entirely frozen and confines task adaptation to lightweight, task-specific additions. This natively supports modularity, efficient memory usage, and minimal training/inference overhead, enabling scalable and sustainable deployment of large models across tasks, languages, and domains.

1. Core Definitions and Taxonomy

Additive PEFT refers to approaches in which auxiliary modules or parameter blocks are introduced alongside the pre-trained model, with the original backbone weights W0W_0 left unchanged. These modules are trained to produce, for each forward pass, an additional term Δh\Delta h or ΔW\Delta W that is added into the corresponding residual stream or parameter tensor of the network. The resulting architecture is typically of the form

(Output)=Backbone(x)+AdditiveModule(x,θΔ)\text{(Output)} = \text{Backbone}(x) + \text{AdditiveModule}(x, \theta_\Delta)

where only θΔ\theta_\Delta is trainable (Prottasha et al., 19 Apr 2025, Sabry et al., 2023).

Common subtypes include:

This taxonomy is structurally characterized in PEFT-Ref’s modular reference architecture, where additive modules are instantiated either at the embedding, attention, or FFN sub-layer, always with the backbone frozen (Sabry et al., 2023).

2. Mathematical Formulation and Module Structure

Standard additive PEFT modules instantiate parameter addition at the sub-layer or parameter level, with mathematical realizations as follows:

  • Adapters (serial):

Adapter(x)=x+Bσ(Ax)\mathrm{Adapter}(x) = x + B\,\sigma(A\,x)

where ARr×dA\in \mathbb{R}^{r\times d}, BRd×rB\in \mathbb{R}^{d\times r}, and σ\sigma is a nonlinear activation (Xue et al., 5 Apr 2025, Prottasha et al., 19 Apr 2025).

  • LoRA (Low-Rank Adaptation):

W=W0+ΔW,ΔW=BAW = W_0 + \Delta W, \qquad \Delta W = B A

with ARr×kA\in\mathbb{R}^{r\times k}, BRd×rB\in\mathbb{R}^{d\times r}, rmin(d,k)r\ll \min(d, k). Only A,BA,B are trainable (Xue et al., 5 Apr 2025, Prottasha et al., 19 Apr 2025, Saha et al., 1 Jan 2026).

  • Soft Prompting:

Input=[P;E(x)]\text{Input} = [P; E(x)]

where PRlp×dP\in\mathbb{R}^{l_p\times d} is a learned input embedding block (Prottasha et al., 19 Apr 2025, Sabry et al., 2023).

  • HiWi Adapters (direct weight update):

W=W+ΔW,ΔW=f(WW)WW' = W + \Delta W, \quad \Delta W = f(W W_{\downarrow}) W_{\uparrow}

where WRd×rW_{\downarrow}\in\mathbb{R}^{d\times r}, WRr×dW_{\uparrow}\in\mathbb{R}^{r\times d}, ff is a point-wise nonlinearity (Liao et al., 2023).

  • Physics-based Additive Adapter (MoPPA):

WMoPPA=i=13αiWi,Wi=DCT1(gi(ω)DCT(W0))W_{\mathrm{MoPPA}} = \sum_{i=1}^3 \alpha_i W_i, \quad W_i = \mathrm{DCT}^{-1}(g_i(\omega)\odot\mathrm{DCT}(W_0))

with gig_i derived from heat, wave, and Poisson equations, and αi\alpha_i learned mixture weights (Wang et al., 2024).

Adapters can be serial or parallel; LoRA modifies attention and FFN projections; prompt/prefix tuning operates at the input or within self-attention layers.

3. Comparative Properties and Trade-Offs

Empirical and architectural studies report several key properties and trade-offs for additive PEFT:

Method Typical Trainable Param % Latency Change Expressivity/Performance
Serial Adapter 2–6% +10–15% ≲1 pt drop vs FT, sometimes ↑
LoRA 0.5–1% ≈0% (mergeable) ≲1 pt drop, matches FT on GLUE
Soft Prompt 0.1–0.5% 0% 1–5 pt lower, esp. in classification
HiWi (bias) 0.03% 0% ≈FT when combined with PaFi
MoPPA ≈0.2%–0.3% +DCT/IDCT Surpasses LoRA by 2.1–7.6 pts VTAB
Prefix Tuning 0.5–1% +Attn overhead Loses on small tasks, strong for gen.*

*See (Prottasha et al., 19 Apr 2025, Sabry et al., 2023, Liao et al., 2023, Wang et al., 2024).

Adapters and LoRA dominate for classification and moderate-resource settings. Soft/prefix prompting is effective especially in low-resource or sequence generation. Physics-based adapters (e.g., MoPPA) yield strong improvements in high-frequency or high-rank domains (Wang et al., 2024).

Compute and memory savings derive from: (a) minimization of backpropagated parameters, (b) negligible inference overhead, especially when additive modules can be merged post-training (e.g., LoRA, HiWi). Sequential modules (adapters) induce some latency due to added MLPs per layer, but LoRA’s low-rank addition is parallelizable and incurs minimal runtime cost.

4. Method-Specific Extensions and Optimizations

Several advanced additive PEFT methods implement further parameter, compute, or training optimizations:

  • FISH-Tuning: Augments additive PEFT (LoRA/Adapter) by measuring Fisher information for all newly added parameters, updating only the most informative (top-k) subset as determined by the diagonal empirical Fisher matrix. This yields task accuracy gains even at extreme sparsity (e.g., LoRA-FISH at 0.0099% learns 1.8 points above vanilla LoRA) (Xue et al., 5 Apr 2025).
  • GRIT (Geometry-Aware Additive PEFT): Applies K-FAC natural gradient preconditioning, periodic reprojection onto Fisher eigendirections, and dynamic rank adaptation on LoRA modules. Empirically, GRIT achieves the same performance as LoRA while reducing the trainable parameter count by up to 80% in some tasks, with lower update drift and improved retention (Saha et al., 1 Jan 2026).
  • HiWi (Direct Weight Adapters): Adapts weight matrices directly; after training, only the "delta" is stored, yielding a per-task adaptation at negligible inference cost and storage (0.03% for bias-level HiWi) (Liao et al., 2023).
  • MoPPA (Mixture of Physical Priors Adapter): Blends three interpretable DCT-domain physical priors (heat, wave, Poisson) with a dynamic routing regularization. Achieves higher accuracy than LoRA and SPT-LoRA on vision transfer and classification tasks, with only 0.26M parameters (Wang et al., 2024).

These techniques build on additive PEFT’s modularity and further minimize the inefficiency of universal adaptation, supporting per-layer or per-task sparsity and target-specific flexibility.

5. Application Domains and Empirical Performance

Additive PEFT methods are now pervasive across language, vision, and speech domains.

  • Language: On GLUE tasks, Adapter and LoRA techniques reach within 0.8–1% of full fine-tuning accuracy on BERT/RoBERTa backbones with only 0.5–6% of weight updates (Prottasha et al., 19 Apr 2025, Xue et al., 5 Apr 2025, Liao et al., 2023).
  • Multilingual ASR: Language extension with additive adapters (placed in specific encoder layers) surpass LoRA and mask-based methods: on 5 new languages, adapters yield an average word error rate (WER) of 17.3% vs. 34.2% for LoRA (with ~11M extra parameters per language). Prompts or bias-only adapters are ineffective for ASR (Liu et al., 2024).
  • Vision: MoPPA adapters on ViT-B deliver a 7.6-point average gain over LoRA at similar parameter budget on VTAB-1K; dynamic combination of PDE-inspired priors enables high-rank/frequency adaptation, outperforming previous low-rank and SPT-LoRA baselines (Wang et al., 2024).
  • Federated and Continual Learning: Additive PEFT’s modularity and the task-agnostic mask generation in PaFi support multi-task scenarios, yielding a new state of the art for transfer, continual, and federated adaptation (Liao et al., 2023).

6. Open Challenges, Recommendations, and Best Practices

Principal challenges in additive PEFT involve:

  • Module Placement: Determining which layers to augment, and which tasks require which module forms (serial/parallel adapter, LoRA, prompt, etc.) (Sabry et al., 2023, Prottasha et al., 19 Apr 2025).
  • Rank/Bottleneck Selection: Optimal values for rr (in LoRA/adapters) are data- and architecture-dependent; dynamic rank adaptation (as in GRIT) is one solution (Saha et al., 1 Jan 2026).
  • Initialization and Stability: Zero-initialization of low-rank or up-projection matrices and careful learning rate control enhance stability (Prottasha et al., 19 Apr 2025).
  • Overfitting in Low-Data Settings: Additive modules (Adapters, LoRA) regularize parameter updates and generalize better in small data regimes compared to prompt tuning.
  • Interaction with Gating/Hybrid Methods: Introducing multiplicative gates can destabilize sparsity-based improvements (e.g., FISH with UniPELT) (Xue et al., 5 Apr 2025).
  • Expressivity in High-Rank/High-Frequency Domains: Pure low-rank methods may underperform; domain-informed additive modules (e.g., MoPPA) address this (Wang et al., 2024).

Empirically, recommendations include leveraging additive adapters (serial/parallel) or low-rank updaters (LoRA/GRIT) for classification or high-resource tasks, physics-inspired adapters for high-frequency vision domains, and Fisher/geometry-aware sparsification to further reduce update budgets without loss in accuracy or convergence speed. At inference time, "merge-and-forget" strategies (as in HiWi, LoRA, GRIT) eliminate any runtime latency or RAM penalty.

7. Future Directions and Theoretical Perspectives

Additive PEFT continues to be extended along several vectors:

  • Automated Module Placement and Rank Selection: Automated methods for determining the optimal additive module location and size remain an open area (Prottasha et al., 19 Apr 2025).
  • Federated, Multi-lingual, and Continual Learning: Fully modular additive PEFT architectures (e.g., PELE) support new-task and new-language adaptation, with evidence that adapters are the most robust for cross-lingual and low-resource continual cases (Liu et al., 2024).
  • Domain- and Modality-Informed Additive Modules: Theoretical and empirical motivation for domain-specific additive PEFT (e.g., physics-based priors, dynamic routing) is growing (Wang et al., 2024).
  • Robustness and Theoretical Guarantees: Understanding why additive adaptation subspaces suffice (optimal LoRA rank, expressivity of MLP adapters) is a focus of ongoing theoretical work (Prottasha et al., 19 Apr 2025, Saha et al., 1 Jan 2026).
  • Efficient Sparsification and Compression: Fisher-guided or curvature-aware pruning of new parameters (FISH, GRIT) is increasing trainable parameter efficiency and reducing drift (Xue et al., 5 Apr 2025, Saha et al., 1 Jan 2026).

Additive PEFT constitutes a principled, practical, and highly adaptable mechanism for adapting large-scale pre-trained models, supporting diverse task configurations with strict efficiency guarantees and compelling empirical performance. Its modularity, extensibility, and rigorous mathematical underpinnings continue to drive advances in scalable model adaptation and deployment (Xue et al., 5 Apr 2025, Prottasha et al., 19 Apr 2025, Sabry et al., 2023, Liao et al., 2023, Wang et al., 2024, Saha et al., 1 Jan 2026, Liu et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Additive PEFT.