Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoA: Mixture-of-Adapters for Efficient Adaptation

Updated 9 June 2026
  • MoA is a suite of parameter-efficient fine-tuning techniques that leverages multiple lightweight adapter experts with dynamic routing to improve performance and generalization.
  • It employs routing strategies such as softmax-based dense routing and top-k sparse selection to enable expert specialization, bias mitigation, and task adaptation without full model retraining.
  • Empirical results demonstrate that MoA methods boost accuracy and adaptability across vision, language, speech, and multimodal domains while adding under 10% extra parameters.

Mixture-of-Adapters (MoA) refers to a family of techniques for parameter-efficient fine-tuning and adaptation of large neural models, in which a set of adapter “experts”, typically lightweight parameterized modules, is inserted at key locations in a frozen or partially-tuned backbone architecture and their outputs are dynamically combined per input or token via a learned routing or weighting mechanism. MoA methods generalize classic adapter-based tuning by leveraging specialization and diversity among adapter modules, often improving performance, generalization, and robustness without incurring the full cost of dense mixtures or full model fine-tuning. The approach has seen rapid adoption across vision, language, speech, multimodal, and continual learning domains, with numerous architectural, optimization, and application variants.

1. Core Architectural Principles

Classic adapter tuning inserts a single low-rank or bottleneck module (adapter) in each transformer block, updating only these lightweight layers while the backbone remains frozen. The Mixture-of-Adapters paradigm generalizes this by replacing the single adapter with a set of NN adapters (experts) per insertion point, each potentially with different initializations, capacities, or architectures. The outputs of these adapters are fused via a learned routing mechanism—often softmax-based gating, top-kk sparse selection, or more complex routing networks. The MoA output at a given location is:

O(x)=x+i=1Nαi(x)  Ai(x)O(x) = x + \sum_{i=1}^N \alpha_i(x) \; A_i(x)

where Ai(x)A_i(x) denotes the output of adapter expert ii for input xx, and αi(x)\alpha_i(x) is a data-dependent weight, typically parameterized by a lightweight gating network.

Adapters themselves may follow bottleneck designs (down-projection \to nonlinearity \to up-projection), parallel or serial insertion, Kronecker- or low-rank factorization (as in LoRA/KAdaptation), or variant convolutional or attention-based structures depending on modality (Zhang et al., 2023, Diao et al., 2023, Wang et al., 2024, Cao et al., 6 Jun 2025, Cappellazzo et al., 2024).

2. Routing and Specialization Mechanisms

The key differentiator of MoA is the utilization of a routing network or mechanism to determine the mixture weights αi(x)\alpha_i(x) for each input. Several flavors exist:

These mechanisms allow adapters to specialize for domains, input features, classes, or tasks, and to adaptively blend pre-trained and newly-acquired knowledge (Diao et al., 2023, Lee et al., 2023).

3. MoA Methodological Variants

Numerous architectural and training instantiations of MoA have been proposed:

  • Mixture of Sparse Adapters (MoSA): A dense adapter is partitioned into non-overlapping sparse modules, each stochastically sampled and updated. After training, modules are merged for efficient inference, achieving superior performance with no increase in inference cost (Zhang et al., 2023).
  • Mixture-of-Domain Adapters (MixDA): Original and domain-specific adapters are computed in parallel and dynamically fused via a gating network. A two-stage learning protocol prevents catastrophic forgetting and ensures domain/generalization (Diao et al., 2023).
  • Heterogeneous MoA for LLMs: Experts are architecturally diverse (LoRA at various sites, bottleneck adapters, prompt tuning). This prevents collapse and load imbalance, with soft or sparse gating for efficiency-performance trade-off (Cao et al., 6 Jun 2025).
  • Self-Expansion and Continual Learning: Modular adapters are expanded only on demand, monitored via autoencoder-based distribution shift detection, with routers learned for efficient reuse and minimal growth (Wang et al., 2024).
  • Adapter Pruning and Weight-Space Mixing: Adapters trained for specific domains can be mixed via weight-space averaging, with empirical generalizability linked to sign agreement across adapter weights, and improved by pruning (Nguyen et al., 2024).
  • Task/Dataset Bias Mitigation: SMoA sparsely activates top-k sub-adapters per token, enabling specialization to mitigate specific dataset biases (Liu et al., 2023).
  • Multimodal/Multitask Fusion: MoA schemes support shared-adapter banks with task-customized routing for unified but adaptive multi-task training (Zhu et al., 2024).

A summary of representative design dimensions across published MoA systems:

Study / System Adapter Placement Routing/Gating Adapter Diversity Sparse or Dense Specialization
(Zhang et al., 2023) MoSA Transformers (visual) Stochastic (per batch) Masked sparse mods Sparse Modular perms
(Diao et al., 2023) MixDA FFN (language) MLP/Sigmoid or Softmax Domain, orig, task Dense Domain/Task
(Cao et al., 6 Jun 2025) MoA LLM layers Linear+sigmoid/sparse Q/K/V/FFN, prompts Both Structural
(Wang et al., 2024) SEMA ViT blocks Softmax over adapters Added on shift Dense/Sublinear Task/distr.
(Cappellazzo et al., 2024) Soft-MoA AST layers (audio) Soft slot assignment Identical Soft Input/slot
(Liu et al., 2023) SMoA All attention/FFN (NLP) Linear, top-k softmax Standard Sparse Bias-specific
(Fujita et al., 2024) TTS MoA Decoder, variance (TTS) Linear-softmax (speaker) Bottleneck Sparse/Soft Speaker
(Zhu et al., 2024) TC-MoA Vision encoder/decoder Task-specific top-k softmax Shared, per-task gate Sparse Fusion task

4. Training Protocols and Objectives

MoA tuning typically follows these regimes:

  • Frozen Backbone: All backbone parameters are kept fixed, adapters and routers are trained.
  • Adapter Parameterization: Each expert is a small parameter module (e.g., up/down projections, bottleneck MLP, LoRA factors, or miniature Convpass blocks in ViTs).
  • Regularization: Auxiliary objectives are used for load balancing (MoE loss), output consistency, feature alignment, cosine decorrelation (to avoid expert collapse), or mutual information regularization for multi-source fusion (Zhang et al., 2023, Cui et al., 2023, Zhu et al., 2024).
  • Task/Dataset-Aware Learning: Specialized losses, including prototype-calibrated contrastive terms, sampling/distance penalties, or knowledge distillation, may be incorporated to encourage domain invariance, specialization, or retention (Diao et al., 2023, Lee et al., 2023, Cui et al., 2023).
  • Efficiency Strategies: Merging sparse modules after training, pruning, and sublinear expansion further enhance efficiency and scalability (Zhang et al., 2023, Wang et al., 2024, Nguyen et al., 2024).

5. Empirical Results and Generalization

Across vision, language, TTS, speech, and multimodal tasks, MoA consistently improves over baseline adapters and, in many cases, full fine-tuning. Empirical patterns include:

  • Visual Recognition: MoSA achieves accuracy gains of 1–2.5 pp over the best prior methods (AdaptFormer, LoRA, full fine-tuning) with no increase in inference or storage, universally across datasets (FGVC, VTAB-1k, GICD) (Zhang et al., 2023).
  • Domain Generalization: Adapter mixtures improve out-of-distribution generalization, provide flatter loss surfaces (lower Hessian eigenvalues), and strategically allocate capacity to simple or complex regions of an input (e.g. foreground vs background tokens) (Lee et al., 2023).
  • NLP Domain Adaptation: MixDA exceeds classic adapters and full-tune baselines by 2–6 pts (50.0% avg. vs 44.2–48.9%) on out-of-domain and few-shot benchmarks, with strong gains in transfer and knowledge-intensive tasks (Diao et al., 2023).
  • Multimodal and ASR: MOSA matches or surpasses much heavier, monolithic projectors in LLM-based ASR, with sharp improvements in data-limited target languages (e.g. 15% relative WER reduction at 60% of baseline parameter count) (Li et al., 26 Aug 2025).
  • Continual Learning: Self-expanding MoA variants achieve higher accuracy (e.g. 86.98% on CIFAR-100) with sublinear growth, showing that adapters can be efficiently reused and expanded with minimal forgetting (Wang et al., 2024, Yu et al., 2024).
  • Low-Resource & Zero-Shot Adaptation: In TTS, MoA allows strong adaptation with <10% trainable parameters and as little as one minute of new-speaker data (Mehrish et al., 2023, Fujita et al., 2024).
  • Bias Mitigation: SMoA demonstrates improved robustness and interpretability against multiple known dataset biases in NLI and paraphrase tasks (Liu et al., 2023).
  • Multitask and Multi-Source Fusion: TC-MoA outperforms competing PEFT and single-adapter approaches on cross-domain fusion (multi-modal, multi-exposure, multi-focus) by learning both shared and task-specific representations (Zhu et al., 2024).

6. Analysis of Specialization, Generalizability, and Efficiency

MoA methods exhibit several consistent technical properties:

  • Specialization: Adapters naturally develop expert roles (e.g., domain, class, or bias specialization), made explicit via gating patterns or analyzed via correlation of expert usage and weight sign agreement (Liu et al., 2023, Nguyen et al., 2024, Li et al., 26 Aug 2025).
  • Efficiency: Merging or sparse gating enables scalability with minimal redundancy. Merged adapters after training match the capacity of dense models with the cost of a single adapter (Zhang et al., 2023, Wang et al., 2024).
  • Generalization: Selective or pruned adapter mixtures (using sign agreement metrics) minimize in-domain accuracy drop; large naive mixtures degrade, but careful selection can reduce the drop to <3 percentage points (Nguyen et al., 2024).
  • Capacity vs. Overfitting: Mixtures mitigate both under- and overfitting by tuning the degree of specialization (number and type of adapters) and using auxiliary balancing losses (Zhang et al., 2023, Cao et al., 6 Jun 2025, Lee et al., 2023).
  • Interpretability: Analysis of gating and expert roles reveals interpretable assignment of subspaces, domains, or input factors.

7. Practical Considerations and Adoption

Best practices include:

MoA is widely adopted in vision (transformers, swins, TTS), NLP (LLMs, PLMs), multi-modal (CLIP), continual learning, and multi-source fusion. Open-source implementations and toolkits now support MoA integration in standard PEFT stacks and transformer libraries.


References

  • "MoSA: Mixture of Sparse Adapters for Visual Efficient Tuning" (Zhang et al., 2023)
  • "Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained LLMs Memories" (Diao et al., 2023)
  • "Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning" (Wang et al., 2024)
  • "MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of LLMs" (Cao et al., 6 Jun 2025)
  • "Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters" (Cappellazzo et al., 2024)
  • "Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning" (Nguyen et al., 2024)
  • "ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation" (Mehrish et al., 2023)
  • "SMoA: Sparse Mixture of Adapters to Mitigate Multiple Dataset Biases" (Liu et al., 2023)
  • "Domain Generalization Using Large Pretrained Models with Mixture-of-Adapters" (Lee et al., 2023)
  • "Lightweight Zero-shot Text-to-Speech with Mixture of Adapters" (Fujita et al., 2024)
  • "Exploring Training on Heterogeneous Data with Mixture of Low-rank Adapters" (Zhou et al., 2024)
  • "Boosting Continual Learning of Vision-LLMs via Mixture-of-Experts Adapters" (Yu et al., 2024)
  • "Task-Customized Mixture of Adapters for General Image Fusion" (Zhu et al., 2024)
  • "MOSA: Mixtures of Simple Adapters Outperform Monolithic Approaches in LLM-based Multilingual ASR" (Li et al., 26 Aug 2025)
Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mixture-of-Adapters (MoA).