PEFT Modules: Efficient Fine-Tuning

Updated 5 December 2025

PEFT modules are specialized components that adapt large pre-trained models by updating only a small fraction of parameters, ensuring efficiency and accuracy.
They employ diverse strategies such as adapters, low-rank updates, soft prompts, and selective tuning to balance expressivity and minimal resource usage.
Empirical benchmarks reveal that methods like LoRA and BitFit achieve near full fine-tuning performance with drastically reduced computational and storage overhead.

Parameter-Efficient Fine-Tuning (PEFT) modules are specialized, low-overhead components designed to adapt large pre-trained models to new tasks by updating or injecting only a small fraction of trainable parameters while keeping the base model weights frozen. The purpose of PEFT modules is to drastically reduce the computational, storage, and cross-task deployment costs associated with full-parameter fine-tuning, without sacrificing downstream accuracy or stability, especially on LLMs and other foundation model architectures (Belanec et al., 2 Dec 2025). The modern ecosystem of PEFT modules includes a diversity of strategies—adapters, low-rank updates, soft prompts, selective rescalers, and combinatorial variants—all engineered to provide an optimal trade-off between expressivity, parameter efficiency, task modularity, and system integration (Han et al., 21 Mar 2024, Sabry et al., 2023).

1. Taxonomy and Architectural Classes of PEFT Modules

PEFT modules are categorized principally according to their architectural locus (input, hidden, output), structural pattern (additive, reparameterization, selective), and functional mechanism:

Additive modules insert small bottleneck MLPs (adapters), soft prompts, or prefix encodings. Adapters typically use a down–up projection sandwich in each transformer block, either after feed-forward or attention sub-layers (Han et al., 21 Mar 2024, Belanec et al., 2 Dec 2025).
Reparameterization modules utilize low-rank decompositions such as LoRA, QLoRA, DoRA, PiSSA, and SVFT, targeting the principal linear transformations in each block by augmenting with compact update matrices or singular components (Belanec et al., 2 Dec 2025).
Selective modules (BitFit, IA³, LNTuning) minimize parameter footprint by tuning only biases, LayerNorm parameters, or per-channel scaling vectors (Belanec et al., 2 Dec 2025, Sabry et al., 2023).
Soft prompt modules (Prompt Tuning, Prefix Tuning, P-Tuning) introduce trainable embeddings at the token or attention-key level without changing model structure (Belanec et al., 2 Dec 2025, Sabry et al., 2023).
Gradient-projection and orthogonality-based modules (GaLore, OFT) constrain the update direction via subspace projection or orthogonal transformations.

A comprehensive modular reference summarizing these variants is provided by "PEFT-Ref" (Sabry et al., 2023).

Module Type	Core Update Mechanism	Typical Overhead
LoRA	$W = W_0 + BA$ , $A \in \mathbb{R}^{r \times d_{\text{in}}}$ , $B \in \mathbb{R}^{d_{\text{out}} \times r}$	$r(d_\text{in} + d_\text{out})$
Adapter	$h' = h + W_2 \phi(W_1 h)$ , $W_1 \in \mathbb{R}^{r \times d}$ , $W_2 \in \mathbb{R}^{d \times r}$	$2 d r$
Prompt/Prefix	Prepend $m$ prompt/prefix vectors	$m \cdot d$ (Prompt), $2 m d_k$ (Prefix)
BitFit	Tune only bias $b$ in each layer ( $b = b_0 + \Delta b$ )	$d$ per layer
IA³	Scale vector modulates attention, FFN ( $\odot \alpha$ , $\alpha \in \mathbb{R}^{\text{heads}}$ )	Small per layer (few thousand)

2. Canonical PEFT Modules and Update Equations

Reparameterization-Based Modules

LoRA:

$W = W_0 + \Delta W, \qquad \Delta W = B\,A$

$A \in \mathbb{R}^{r \times d_{\text{in}}}$ , $B \in \mathbb{R}^{d_{\text{out}} \times r}$ with $r \ll \min(d_{\text{out}}, d_{\text{in}})$ (Belanec et al., 2 Dec 2025).

QLoRA:

Same as LoRA, but backpropagates through a quantized 4-bit copy using paged optimizers and NF4 data type.

DoRA:

Applies LoRA to the normalized direction of $W_0$ :

$W = \|W_0\| \hat{W}_0 + B\,A$

OFT (Orthogonal Fine-Tuning):

Maintains orthogonality constraint on weights via Cayley transform:

$W = (I - \frac{1}{2} \Omega)^{-1} (I + \frac{1}{2} \Omega) W_0$

where $\Omega$ is skew-symmetric (Belanec et al., 2 Dec 2025).

Adapter-Based Modules

Bottleneck Adapter:

$h' = h + W_2 \phi(W_1 h), \quad W_1\in\mathbb R^{r\times d},\,W_2\in\mathbb R^{d\times r}$

Applied after attention or FFN sublayers.

Selective PEFT

BitFit:

$W = W_0, \quad b = b_0 + \Delta b$

Only biases are trainable (Belanec et al., 2 Dec 2025).

IA³:

Scales attention and FFN outputs with learned vectors:

$\text{Attn}(Q,K,V) \to \text{Attn}(Q,K,V)\odot\alpha, \quad \alpha\in\mathbb R^{\text{heads}}$

Soft Prompting

Prompt Tuning:

Prepends $m \times d$ learnable embeddings as input tokens; these are input-side only.

Prefix Tuning:

For each attention layer, prepends learned keys/values to the KV slots, enabling per-layer prompt control.

3. Modular Design and Extensibility

Frameworks such as PEFT-Factory (Belanec et al., 2 Dec 2025) implement a modular, pluggable ecosystem where each PEFT method is encapsulated as a module with a configuration object and implementation class. This enables both uniform benchmarking and rapid extension:

Plugin API: New methods integrate via a PeftConfig and BaseTuner pair, placed in a discovered subdirectory, and registered dynamically.
Argument Aggregation: Unified management of hyperparameters, seeds, and PEFT-specific tuning options.
Experimentation: Supports >100 Transformer families for model diversity.

The plug-and-play nature of this architecture is a key enabler for reproducible, controlled PEFT research and benchmarking (Belanec et al., 2 Dec 2025).

4. Practices for Composition and Combination

The recent literature demonstrates that PEFT modules can be algebraically composed to yield complex behaviors—distribution generalization, multitasking, unlearning, domain adaptation—through module-space operations such as addition, scaling, and negation (Zhang et al., 2023, Patel et al., 24 Jan 2025):

Addition: $M_a \oplus M_b = \{A_a + A_b, B_a + B_b\}$ (for LoRA)
Negation: $\ominus \{A,B\} = \{A, -B\}$

Compositionality enables zero-cost transfer: new skills can be added by merging parameter deltas from independent modules, and unwanted skills can be "subtracted". Composed modules often outperform individual ones across distribution shifts, multitask, and adversarial transfer settings (Zhang et al., 2023). The approach generalizes over several architectures, including LoRA, IA³, and adapters (Patel et al., 24 Jan 2025).

5. Empirical Benchmarks, Task Coverage, and Module Trade-Offs

PEFT benchmarking spans diverse classification, reasoning, math, code, and NLU tasks, with standardized metrics:

Classification: Accuracy, F1
Open/Generative tasks: Token accuracy, ROUGE, BLEU, CodeBLEU
Efficiency: PSCP metric captures trade-offs among trainable parameter count, peak memory, and inference time (Belanec et al., 2 Dec 2025).

Empirical findings:

BitFit, with ~0.01% parameter overhead, matches or exceeds more complex methods on basic classification and simple math.
LoRA and variants (DoRA, PiSSA) reach full-finetune performance at 0.1–1% overhead for large-scale generative tasks.
Soft prompt methods (Prefix Tuning, P-Tuning variants) are highly parameter-efficient, performing well in low-data/few-shot but less robust to reasoning challenges or "coreference" tasks.
Adapter networks provide a modular mid-ground for tasks requiring moderate parameterization (1–3% overhead) and support task separation (Belanec et al., 2 Dec 2025).

Experimental results (LLaMA-3.2-1B-Instruct):

Method	SST-2	CoLA	WSC	SVAMP
BitFit	97.5	86.9	55.2	92.3
IA³	95.3	85.3	3.6	84.1
PrefixTuning	96.3	88.8	0.8	91.4

BitFit demonstrates strong performance given its extremely low parameter cost. PrefixTuning is competitive except on certain structured reasoning tasks such as WSC (Belanec et al., 2 Dec 2025).

6. Best Practices, Guidelines, and Deployment Strategies

Based on large-scale cross-method studies (Belanec et al., 2 Dec 2025), key recommendations are:

For simple tasks and single-token labels, start with BitFit for maximal efficiency.
For generative, QA, or summarization tasks, favor LoRA and its variants to achieve near-full fine-tune performance with minimal trainable parameters.
Use soft prompts (Prefix, P-Tuning) in few-shot or extremely low-data settings, but expect accuracy degradation on tasks requiring structured reasoning.
Preferring adapter networks is recommended when a moderate parameter budget and task modularity are important.
Continuously monitor the PSCP score (or equivalent) to achieve the desired accuracy–memory–inference time trade-off.
Leverage unified configuration and plugin APIs to facilitate reproducible benchmarking, extensibility, and fair comparisons across new PEFT modules.

The PEFT-Factory ecosystem exemplifies these best practices, enabling single-config experimental control and fair evaluation over 27 datasets and 19 PEFT methods (Belanec et al., 2 Dec 2025).

7. Open Directions and Composition Beyond Scalar Weighting

PEFT module composition is not limited to arithmetic operations (addition, negation, scaling); more sophisticated schemes such as weighted summation, non-linear fusion, Fisher-weighted merging, fusion adapters after merge, and mode-connectivity-based interpolation remain active areas of investigation (Patel et al., 24 Jan 2025, Zhang et al., 2023). Challenges in scaling composition arise from non-orthogonality, interference, and alignment of module magnitude. Promising avenues include module compositionality for multi-trait, style, or domain transfer, human-in-the-loop selection of composite deltas, and learning small fusion adapters post-merge to address residual interference. Modular design principles are critical for both research progress and practical downstream deployment at scale (Patel et al., 24 Jan 2025, Zhang et al., 2023, Belanec et al., 2 Dec 2025).