MoA: Mixture-of-Adapters for Efficient Adaptation
- MoA is a suite of parameter-efficient fine-tuning techniques that leverages multiple lightweight adapter experts with dynamic routing to improve performance and generalization.
- It employs routing strategies such as softmax-based dense routing and top-k sparse selection to enable expert specialization, bias mitigation, and task adaptation without full model retraining.
- Empirical results demonstrate that MoA methods boost accuracy and adaptability across vision, language, speech, and multimodal domains while adding under 10% extra parameters.
Mixture-of-Adapters (MoA) refers to a family of techniques for parameter-efficient fine-tuning and adaptation of large neural models, in which a set of adapter “experts”, typically lightweight parameterized modules, is inserted at key locations in a frozen or partially-tuned backbone architecture and their outputs are dynamically combined per input or token via a learned routing or weighting mechanism. MoA methods generalize classic adapter-based tuning by leveraging specialization and diversity among adapter modules, often improving performance, generalization, and robustness without incurring the full cost of dense mixtures or full model fine-tuning. The approach has seen rapid adoption across vision, language, speech, multimodal, and continual learning domains, with numerous architectural, optimization, and application variants.
1. Core Architectural Principles
Classic adapter tuning inserts a single low-rank or bottleneck module (adapter) in each transformer block, updating only these lightweight layers while the backbone remains frozen. The Mixture-of-Adapters paradigm generalizes this by replacing the single adapter with a set of adapters (experts) per insertion point, each potentially with different initializations, capacities, or architectures. The outputs of these adapters are fused via a learned routing mechanism—often softmax-based gating, top- sparse selection, or more complex routing networks. The MoA output at a given location is:
where denotes the output of adapter expert for input , and is a data-dependent weight, typically parameterized by a lightweight gating network.
Adapters themselves may follow bottleneck designs (down-projection nonlinearity up-projection), parallel or serial insertion, Kronecker- or low-rank factorization (as in LoRA/KAdaptation), or variant convolutional or attention-based structures depending on modality (Zhang et al., 2023, Diao et al., 2023, Wang et al., 2024, Cao et al., 6 Jun 2025, Cappellazzo et al., 2024).
2. Routing and Specialization Mechanisms
The key differentiator of MoA is the utilization of a routing network or mechanism to determine the mixture weights for each input. Several flavors exist:
- Soft (Dense) Routing: All experts are linearly combined using a softmax or sigmoid over expert scores derived from each input (or token). This allows for differentiable end-to-end learning and smooth specialization (Cappellazzo et al., 2024, Cao et al., 6 Jun 2025, Diao et al., 2023).
- Sparse Routing / Top-0: Only the 1 highest-scoring experts receive non-zero weight for each input (sparse gating), reducing compute and encouraging greater specialization (Zhang et al., 2023, Liu et al., 2023, Yu et al., 2024).
- Hierarchical Routing: Gating is factorized, e.g. some modules are always active (dense), others are sparsified, to address data scarcity and expert dilution (Zhang et al., 2023).
- Domain- or Prompt-Aware Routing: Incorporates learnable prompts or external metadata (e.g. speaker embedding, task/domain ID) into the gating logic, enabling adaptive selection by context (Liu et al., 6 Mar 2025, Fujita et al., 2024).
- Task-Adaptive / Expansion Routers: In continual/multi-task settings, routers grow with tasks and can support dynamic or sublinear expansion based on distribution shift (Wang et al., 2024, Zhou et al., 2024).
These mechanisms allow adapters to specialize for domains, input features, classes, or tasks, and to adaptively blend pre-trained and newly-acquired knowledge (Diao et al., 2023, Lee et al., 2023).
3. MoA Methodological Variants
Numerous architectural and training instantiations of MoA have been proposed:
- Mixture of Sparse Adapters (MoSA): A dense adapter is partitioned into non-overlapping sparse modules, each stochastically sampled and updated. After training, modules are merged for efficient inference, achieving superior performance with no increase in inference cost (Zhang et al., 2023).
- Mixture-of-Domain Adapters (MixDA): Original and domain-specific adapters are computed in parallel and dynamically fused via a gating network. A two-stage learning protocol prevents catastrophic forgetting and ensures domain/generalization (Diao et al., 2023).
- Heterogeneous MoA for LLMs: Experts are architecturally diverse (LoRA at various sites, bottleneck adapters, prompt tuning). This prevents collapse and load imbalance, with soft or sparse gating for efficiency-performance trade-off (Cao et al., 6 Jun 2025).
- Self-Expansion and Continual Learning: Modular adapters are expanded only on demand, monitored via autoencoder-based distribution shift detection, with routers learned for efficient reuse and minimal growth (Wang et al., 2024).
- Adapter Pruning and Weight-Space Mixing: Adapters trained for specific domains can be mixed via weight-space averaging, with empirical generalizability linked to sign agreement across adapter weights, and improved by pruning (Nguyen et al., 2024).
- Task/Dataset Bias Mitigation: SMoA sparsely activates top-k sub-adapters per token, enabling specialization to mitigate specific dataset biases (Liu et al., 2023).
- Multimodal/Multitask Fusion: MoA schemes support shared-adapter banks with task-customized routing for unified but adaptive multi-task training (Zhu et al., 2024).
A summary of representative design dimensions across published MoA systems:
| Study / System | Adapter Placement | Routing/Gating | Adapter Diversity | Sparse or Dense | Specialization |
|---|---|---|---|---|---|
| (Zhang et al., 2023) MoSA | Transformers (visual) | Stochastic (per batch) | Masked sparse mods | Sparse | Modular perms |
| (Diao et al., 2023) MixDA | FFN (language) | MLP/Sigmoid or Softmax | Domain, orig, task | Dense | Domain/Task |
| (Cao et al., 6 Jun 2025) MoA | LLM layers | Linear+sigmoid/sparse | Q/K/V/FFN, prompts | Both | Structural |
| (Wang et al., 2024) SEMA | ViT blocks | Softmax over adapters | Added on shift | Dense/Sublinear | Task/distr. |
| (Cappellazzo et al., 2024) Soft-MoA | AST layers (audio) | Soft slot assignment | Identical | Soft | Input/slot |
| (Liu et al., 2023) SMoA | All attention/FFN (NLP) | Linear, top-k softmax | Standard | Sparse | Bias-specific |
| (Fujita et al., 2024) TTS MoA | Decoder, variance (TTS) | Linear-softmax (speaker) | Bottleneck | Sparse/Soft | Speaker |
| (Zhu et al., 2024) TC-MoA | Vision encoder/decoder | Task-specific top-k softmax | Shared, per-task gate | Sparse | Fusion task |
4. Training Protocols and Objectives
MoA tuning typically follows these regimes:
- Frozen Backbone: All backbone parameters are kept fixed, adapters and routers are trained.
- Adapter Parameterization: Each expert is a small parameter module (e.g., up/down projections, bottleneck MLP, LoRA factors, or miniature Convpass blocks in ViTs).
- Regularization: Auxiliary objectives are used for load balancing (MoE loss), output consistency, feature alignment, cosine decorrelation (to avoid expert collapse), or mutual information regularization for multi-source fusion (Zhang et al., 2023, Cui et al., 2023, Zhu et al., 2024).
- Task/Dataset-Aware Learning: Specialized losses, including prototype-calibrated contrastive terms, sampling/distance penalties, or knowledge distillation, may be incorporated to encourage domain invariance, specialization, or retention (Diao et al., 2023, Lee et al., 2023, Cui et al., 2023).
- Efficiency Strategies: Merging sparse modules after training, pruning, and sublinear expansion further enhance efficiency and scalability (Zhang et al., 2023, Wang et al., 2024, Nguyen et al., 2024).
5. Empirical Results and Generalization
Across vision, language, TTS, speech, and multimodal tasks, MoA consistently improves over baseline adapters and, in many cases, full fine-tuning. Empirical patterns include:
- Visual Recognition: MoSA achieves accuracy gains of 1–2.5 pp over the best prior methods (AdaptFormer, LoRA, full fine-tuning) with no increase in inference or storage, universally across datasets (FGVC, VTAB-1k, GICD) (Zhang et al., 2023).
- Domain Generalization: Adapter mixtures improve out-of-distribution generalization, provide flatter loss surfaces (lower Hessian eigenvalues), and strategically allocate capacity to simple or complex regions of an input (e.g. foreground vs background tokens) (Lee et al., 2023).
- NLP Domain Adaptation: MixDA exceeds classic adapters and full-tune baselines by 2–6 pts (50.0% avg. vs 44.2–48.9%) on out-of-domain and few-shot benchmarks, with strong gains in transfer and knowledge-intensive tasks (Diao et al., 2023).
- Multimodal and ASR: MOSA matches or surpasses much heavier, monolithic projectors in LLM-based ASR, with sharp improvements in data-limited target languages (e.g. 15% relative WER reduction at 60% of baseline parameter count) (Li et al., 26 Aug 2025).
- Continual Learning: Self-expanding MoA variants achieve higher accuracy (e.g. 86.98% on CIFAR-100) with sublinear growth, showing that adapters can be efficiently reused and expanded with minimal forgetting (Wang et al., 2024, Yu et al., 2024).
- Low-Resource & Zero-Shot Adaptation: In TTS, MoA allows strong adaptation with <10% trainable parameters and as little as one minute of new-speaker data (Mehrish et al., 2023, Fujita et al., 2024).
- Bias Mitigation: SMoA demonstrates improved robustness and interpretability against multiple known dataset biases in NLI and paraphrase tasks (Liu et al., 2023).
- Multitask and Multi-Source Fusion: TC-MoA outperforms competing PEFT and single-adapter approaches on cross-domain fusion (multi-modal, multi-exposure, multi-focus) by learning both shared and task-specific representations (Zhu et al., 2024).
6. Analysis of Specialization, Generalizability, and Efficiency
MoA methods exhibit several consistent technical properties:
- Specialization: Adapters naturally develop expert roles (e.g., domain, class, or bias specialization), made explicit via gating patterns or analyzed via correlation of expert usage and weight sign agreement (Liu et al., 2023, Nguyen et al., 2024, Li et al., 26 Aug 2025).
- Efficiency: Merging or sparse gating enables scalability with minimal redundancy. Merged adapters after training match the capacity of dense models with the cost of a single adapter (Zhang et al., 2023, Wang et al., 2024).
- Generalization: Selective or pruned adapter mixtures (using sign agreement metrics) minimize in-domain accuracy drop; large naive mixtures degrade, but careful selection can reduce the drop to <3 percentage points (Nguyen et al., 2024).
- Capacity vs. Overfitting: Mixtures mitigate both under- and overfitting by tuning the degree of specialization (number and type of adapters) and using auxiliary balancing losses (Zhang et al., 2023, Cao et al., 6 Jun 2025, Lee et al., 2023).
- Interpretability: Analysis of gating and expert roles reveals interpretable assignment of subspaces, domains, or input factors.
7. Practical Considerations and Adoption
Best practices include:
- Adapter Diversity: Heterogeneous experts (different architectures or positions in the layer) outperform homogeneous sets, avoiding representational collapse and load imbalance (Cao et al., 6 Jun 2025).
- Routing Simplicity: Linear routers/MLPs with softmax or sparse selection suffice; domain or task prompts can be injected for greater adaptation (Liu et al., 6 Mar 2025).
- Sparse vs. Soft Fusion: Soft recipes maximize accuracy with moderate compute overhead; sparse (top-2 or stochastically sampled) variants balance cost and performance (Zhang et al., 2023, Liu et al., 2023, Cao et al., 6 Jun 2025).
- Trainability: MoA scales well to large numbers of tasks or domains when adapters are reused and expanded judiciously (Wang et al., 2024, Zhou et al., 2024).
- Parameter Budgets: MoA typically adds <10% of the base parameters; efficiency and scaling depend on adapter size, number, and placement (Cappellazzo et al., 2024, Mehrish et al., 2023, Fujita et al., 2024).
MoA is widely adopted in vision (transformers, swins, TTS), NLP (LLMs, PLMs), multi-modal (CLIP), continual learning, and multi-source fusion. Open-source implementations and toolkits now support MoA integration in standard PEFT stacks and transformer libraries.
References
- "MoSA: Mixture of Sparse Adapters for Visual Efficient Tuning" (Zhang et al., 2023)
- "Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained LLMs Memories" (Diao et al., 2023)
- "Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning" (Wang et al., 2024)
- "MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of LLMs" (Cao et al., 6 Jun 2025)
- "Efficient Fine-tuning of Audio Spectrogram Transformers via Soft Mixture of Adapters" (Cappellazzo et al., 2024)
- "Generalizability of Mixture of Domain-Specific Adapters from the Lens of Signed Weight Directions and its Application to Effective Model Pruning" (Nguyen et al., 2024)
- "ADAPTERMIX: Exploring the Efficacy of Mixture of Adapters for Low-Resource TTS Adaptation" (Mehrish et al., 2023)
- "SMoA: Sparse Mixture of Adapters to Mitigate Multiple Dataset Biases" (Liu et al., 2023)
- "Domain Generalization Using Large Pretrained Models with Mixture-of-Adapters" (Lee et al., 2023)
- "Lightweight Zero-shot Text-to-Speech with Mixture of Adapters" (Fujita et al., 2024)
- "Exploring Training on Heterogeneous Data with Mixture of Low-rank Adapters" (Zhou et al., 2024)
- "Boosting Continual Learning of Vision-LLMs via Mixture-of-Experts Adapters" (Yu et al., 2024)
- "Task-Customized Mixture of Adapters for General Image Fusion" (Zhu et al., 2024)
- "MOSA: Mixtures of Simple Adapters Outperform Monolithic Approaches in LLM-based Multilingual ASR" (Li et al., 26 Aug 2025)