AdapterFusion: Modular PEFT for Transfer

Updated 12 January 2026

AdapterFusion is a parameter-efficient framework that enables modular fusion of frozen, task-specific adapters using lightweight attention.
The architecture employs a two-stage process: first, training independent adapters; then, fusing their outputs for non-destructive knowledge transfer.
Empirical results show AdapterFusion improves performance in NLU, code, and ASR tasks while significantly reducing parameter overhead.

AdapterFusion is a parameter-efficient fine-tuning (PEFT) architecture designed to facilitate modular, non-destructive knowledge transfer across diverse tasks or domains in large pre-trained models. By training compact task- or domain-specific adapters and subsequently learning to fuse their outputs via lightweight attention modules, AdapterFusion enables efficient, composition-based transfer on new tasks without altering either the base model or source adapters. This approach is now foundational for multi-task adaptation, modular bias mitigation, and cross-lingual or cross-modal transfer in both natural language and code domains.

1. Architectural Principles and Mathematical Formulation

AdapterFusion operates within a two-stage framework. In the first stage, a set of $N$ adapters $\{\theta_1,\dots,\theta_N\}$ are trained independently on distinct source tasks or domains, inserted after the feed-forward network (FFN) in each Transformer block, while freezing the core model parameters $\Theta$ . Each adapter at layer $\ell$ applies a bottleneck transformation: $\mathrm{Adapter}_\ell(h, r) = r + U_\ell\bigl(\mathrm{ReLU}(D_\ell h)\bigr)$ with $h,r \in \mathbb{R}^h$ , $D_\ell \in \mathbb{R}^{h \times d}$ and $U_\ell \in \mathbb{R}^{d \times h}$ for adapter bottleneck size $d \ll h$ .

In the second stage, AdapterFusion injects, per Transformer layer, a new set of fusion parameters $\Psi = \{W^Q_\ell, W^K_\ell, W^V_\ell\}$ that attend over the fixed adapter outputs. Given post-FFN hidden state $h_\ell$ (the "query") and the $N$ adapter outputs $h_{\ell,1},\dots,h_{\ell,N}$ , AdapterFusion computes: $Q_\ell = W^Q_\ell h_\ell,\quad K_{\ell,i} = W^K_\ell h_{\ell,i},\quad V_{\ell,i} = W^V_\ell h_{\ell,i}$

$g_{\ell,i} = \mathrm{Softmax}_i\left(\frac{Q_\ell \cdot K_{\ell,i}}{\sqrt{d_k}}\right)$

$h^{fused}_\ell = \sum_{i=1}^N g_{\ell,i} V_{\ell,i}$

$h^{fused}_\ell$ is then residually added and forms the output of the Transformer block. Only the fusion parameters are updated during this knowledge composition step; all base model and adapter weights remain frozen (Pfeiffer et al., 2020, Saberi et al., 2023, Esmaeili et al., 3 Nov 2025).

2. Training and Implementation Workflow

The established AdapterFusion pipeline is:

Knowledge Extraction: Independently train each adapter $\theta_j$ for a source domain/task on its own data $\mathcal{D}_j$ , holding $\Theta$ fixed.
Knowledge Composition: (a) Freeze both $\Theta$ and all $\theta_j$ . (b) Introduce and train per-layer fusion modules $\Psi$ solely on the target task, typically via cross-entropy or next-token prediction loss. Optimization uses AdamW with learning rates $10^{-5}$ -- $10^{-4}$ and weight decay $0.01$ (Esmaeili et al., 3 Nov 2025, Saberi et al., 2023, Pfeiffer et al., 2020, Han et al., 2024).

The bottlenecked adapters and lightweight fusion heads enforce strong parameter efficiency. For a 12-layer, $h=768$ model, a single adapter adds $\sim0.1$ M parameters per layer, while fusion modules typically add $O(N h^2)$ parameters per layer. The number of trainable parameters is thus an order of magnitude smaller than full fine-tuning, while enabling rich composition of latent knowledge.

3. Empirical Performance and Use Cases

Across diverse domains—natural language understanding (NLU), multilingual code summarization, speech recognition, and bias mitigation—AdapterFusion consistently matches or outperforms strong baselines, especially in data-scarce target tasks:

GLUE/SuperGLUE (NLU): AdapterFusion improves mean dev-set accuracy by +1.5–1.8 points over full fine-tuning or single-task adapters, with the largest gains on low-resource tasks (e.g., RTE: +12.2 pts) (Pfeiffer et al., 2020, Frohmann et al., 2023).
Code Summarization/Method Naming: AdapterFusion delivers +0.6–1.7 BLEU and +1–2 F1 improvements over LoRA and monolingual adapters, with a 4× parameter savings (Saberi et al., 2023, Esmaeili et al., 3 Nov 2025).
Speech Recognition (ASR): AdapterFusion variants achieve an 8% mean WER reduction compared to full fine-tuning, with only 17–23% of model parameters updated (Ngai et al., 2023).
Bias Mitigation: Modular AdapterFusion (DAM) reduces protected-attribute leakage (e.g., gender 64.4% $\to$ 53.3%) without sacrificing task accuracy, enabling plug-in fairness control (Kumar et al., 2023).

A compact summary of PEFT performance appears below:

Domain	AdapterFusion Param %	Metric (Improvement over Baseline)
Code (BLEU-4)	~18%	+0.6–1.7 BLEU
NLU (Accuracy)	~9%	+1.5–1.8 pts
Speech (WER)	~17–23%	–8% rel. WER
Bias Mitigation	~1–13%	–10–20% protected attribute leakage

On extremely limited data (few-shot; $<$ 30 examples), however, AdapterFusion's relatively larger fusion module can be outperformed by more aggressive parameter- or representation-merging strategies (e.g., MerA or ScaLearn), but outperforms monolithic adapters in standard PEFT regimes (He et al., 2023, Frohmann et al., 2023).

4. Strengths, Limitations, and Extensions

Strengths:

Non-destructive and modular: Adapters and the backbone remain untouched; dynamic reuse and plug-in composition is possible (Han et al., 2024).
Layer-wise interpretability: Attention scores $g_{\ell,i}$ provide explicit insight into source contributions (Esmaeili et al., 3 Nov 2025).
Parameter efficiency: Fusion incurs only a fraction of full-model tuning overhead, enabling rapid iteration and serving (Esmaeili et al., 3 Nov 2025, Saberi et al., 2023).
Integrates into multi-tenant inference: Multiple tasks/domains share a backbone; per-task overhead is limited to the fusion head and selected adapters (Han et al., 2024).

Limitations:

Target bias: Without explicit regularization, the fusion head often collapses to the adapter matching the target input's domain, limiting cross-domain transfer (Saberi et al., 2023, Esmaeili et al., 3 Nov 2025). A plausible implication is suboptimal knowledge utilization for low-resource targets.
Parameter scaling: Each additional target task requires its own fusion head; parameter and computational overhead scale linearly with the number of adapters/tasks (Frohmann et al., 2023).
Few-shot learning: In minimal data settings, AdapterFusion's overhead can introduce overfitting or underutilization, with simple adapters or MerA often preferred (He et al., 2023).

Notable extensions include AdvFusion, which employs adversarial masking to enforce cross-adapter knowledge sharing, and Audio-AdapterFusion, which adapts the architecture for ASR by eliminating timed routing (Saberi et al., 2023, Esmaeili et al., 3 Nov 2025, Ngai et al., 2023).

5. Applications in Natural Language, Code, and Beyond

AdapterFusion has been validated across a spectrum of domains:

Natural Language (GLUE, SuperGLUE, argument mining, sentiment analysis, NLI): Transfer between tasks via reusable NLU adapters (Pfeiffer et al., 2020, Frohmann et al., 2023).
Multilingual Code Understanding and Generation: Composition of language adapters enables robust performance for cross-language code summarization, method naming, and translation in CodeBERT/GraphCodeBERT (Esmaeili et al., 3 Nov 2025, Saberi et al., 2023).
Bias and Fairness Control: DAM leverages AdapterFusion to provide attribute-specific debiasing in transformer models without catastrophic forgetting, and with plug-and-play attribute masking at inference (Kumar et al., 2023).
Speech Recognition: Multi-task ASR uses AdapterFusion-like adapters for task-agnostic deployment, outperforming single-adapter routing (Ngai et al., 2023).

A recurring theme is AdapterFusion's value in leveraging pre-trained, domain-specialized adapters for rapid adaptation to new domains or cross-domain amalgamation—while minimizing computation and catastrophic forgetting.

6. Comparative Perspective and Evolving Alternatives

While AdapterFusion remains a cornerstone of PEFT-based knowledge composition, recent work has identified contexts where its quadratic parameter scaling with adapters/tasks or fusion module expressiveness may be suboptimal:

ScaLearn: Achieves similar or better transfer with two orders of magnitude fewer fusion parameters, via simple scaling coefficients instead of full attention fusion (Frohmann et al., 2023).
MerA: For few-shot scenarios, direct alignment and merging of adapter weights yields superior accuracy and lower overhead, particularly when sources are "same-track" (task-matched) (He et al., 2023).
Compacter, LoRA: Alternative PEFT mechanisms sometimes outperform AdapterFusion for long-sequence code generation and translation (Esmaeili et al., 3 Nov 2025).

A plausible implication is that AdapterFusion occupies a “sweet spot” between maximally flexible on-the-fly composition and parameter-minimal static merging, and that hybrid schemes (e.g., adversarial masking, AdapterDrop) may further enhance its performance envelope (Esmaeili et al., 3 Nov 2025, Saberi et al., 2023, Han et al., 2024).

7. Summary and Future Directions

AdapterFusion formalizes a modular learning paradigm within the PEFT ecosystem, supporting scalable, non-destructive, and interpretable transfer learning. Its attention-based composition of frozen adapters supports multi-domain and multi-attribute transfer with minimal additional parameters. Open challenges remain in achieving deeper cross-domain knowledge integration, enhancing few-shot efficiency, and scaling to very large adapter libraries. Ongoing research continues to explore parameter-efficient alternatives and adversarial or multi-modal extensions, ensuring AdapterFusion's continuing relevance in modern adaptation and transfer learning pipelines (Pfeiffer et al., 2020, Saberi et al., 2023, Esmaeili et al., 3 Nov 2025, Han et al., 2024).

Markdown Upgrade to Chat

References (8)

AdapterFusion: Non-Destructive Task Composition for Transfer Learning (2020)

AdvFusion: Adapter-based Knowledge Transfer for Code Summarization on Code Language Models (2023)

Analysis of AdvFusion: Adapter-based Multilingual Learning for Code Large Language Models (2025)

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey (2024)

ScaLearn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale (2023)

Audio-AdapterFusion: A Task-ID-free Approach for Efficient and Non-Destructive Multi-task Speech Recognition (2023)

Parameter-efficient Modularised Bias Mitigation via AdapterFusion (2023)

MerA: Merging Pretrained Adapters For Few-Shot Learning (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AdapterFusion.