Modality-Aware Mixture-of-Experts

Updated 26 March 2026

MAMoE is a modality-aware architecture that partitions expert subnetworks to route tokens based on their input modality for specialized processing.
It employs dynamic routing mechanisms—including modality tokens, biases, and hypernetworks—with sparse top-K selection to enhance efficiency and adaptability.
Empirical studies demonstrate notable gains in accuracy, efficiency, and performance across diverse tasks such as remote sensing, ASR, and vision-language models.

A Modality-Aware Mixture-of-Experts (MAMoE) architecture comprises a class of sparse neural architectures in which expert subnetworks, gating functions, and token-to-expert routing are explicitly conditioned on the modality of the input. In contrast to vanilla MoE frameworks that treat all tokens identically, MAMoE systems partition expert capacity and/or routing pathways to enable specialization along modality boundaries, facilitating adaptive fusion, improved efficiency, and robust downstream performance in multimodal environments. The term spans a diversity of implementations across computer vision, language modeling, audio/speech, 3D, and knowledge graph settings, unified by their explicit awareness and conditioning on input modality during expert routing, parameter allocation, or both.

The defining feature of MAMoE architectures is their partitioned expert structure and modality-conditional token routing. Consider a Transformer backbone in which the standard feed-forward network (FFN) modules are replaced with MoE blocks. The MAMoE approach modifies both expert pool definition and routing logic as follows:

Expert Grouping: Given $m$ modalities, experts are partitioned into (possibly overlapping) modality-specific groups. For example, in MAPEX for remote sensing, experts $1..k$ serve modality 1, experts $k+1..2k$ serve modality 2, etc., with an optional "shared" expert bridging all modalities (Hanna et al., 10 Jul 2025). In MoST for speech-text LLMs, each MoE block contains disjoint groups for text and audio, plus one shared expert (Lou et al., 15 Jan 2026). MoMa for mixed-modal Transformers maintains separate expert groups for each modality (e.g., 4 text and 4 image experts) (Lin et al., 2024).
Routing/Gating Mechanisms: The router takes as input not only the token hidden state but an explicit or implicit modality identifier (e.g., a [MOD_m] embedding, masking, or modality-specific bias vector). Routing can be modality-conditioned by:
- Concatenating modality tokens to the token representation prior to gating (Hanna et al., 10 Jul 2025).
- Applying disjoint routing masks so that only relevant expert groups are eligible for a given token (Lin et al., 2024, Lou et al., 15 Jan 2026, Lee et al., 13 Feb 2026).
- Adding trainable modality-specific biases to router logits, leading to soft modality-specialization (Xia et al., 6 Jun 2025).
- Employing hypernetworks that generate per-token, per-modality routing parameters (Jing et al., 28 May 2025).
- Using dynamic RUS-context (Redundancy/Unique/Synergy) analysis for time-varying multimodal routing (Han et al., 30 Sep 2025).
Expert Selection: Routing is typically sparse top- $K$ (e.g., $K=1$ or $K=2$ ), sometimes using straight-through one-hot selection to maximize specialization (Li et al., 27 Nov 2025). For vision-language LLMs, vision tail tokens activate more experts to address long-tailed distribution effects (Cai et al., 2 Jul 2025).

These mechanisms ensure that tokens are primarily dispatched to experts best suited for their modality, with shared or cross-modal experts optionally supporting information fusion.

2. Training Objectives, Pruning, and Specialization

Pre-Training and Fine-Tuning Losses

MAMoE models combine standard task-driven loss terms with regularizers to promote expert utilization balance and specialization:

Self-Supervised/Masked Pre-Training: For vision or remote sensing, masked autoencoding is performed with independent random masks per modality, and a reconstruction loss applied across all unmasked positions (Hanna et al., 10 Jul 2025).
Task Losses and Load Balancing: For classification, cross-entropy; for sequence modeling, next-token prediction loss. Load-balancing loss penalizes expert under- or over-utilization, typically $\mathcal{L}_{\text{load}} = \sum_{j=1}^{E} (U_j - 1/E)^2$ , with $U_j$ the usage fraction of expert $j$ .
Auxiliary Regularizers:
- KL-based Modality Divergence: For soft specialization, the SMAR framework imposes KL-divergence regularization between empirical expert-usage distributions of different modalities, maintaining a target separation band (Xia et al., 6 Jun 2025).
- Mutual Information Disentanglement: To encourage diverse features among experts per modality, the CLUB upper bound is minimized to disentangle outputs (Zhang et al., 2024).
- Interaction-Aware Losses: Weakly supervised losses (triplet, contrastive, or cosine similarity) enforce uniqueness, redundancy, or synergy specialization among interaction experts (Xin et al., 25 May 2025, Han et al., 30 Sep 2025).

Modality-Aware Pruning

A distinctive facet of certain MAMoE systems is post hoc pruning:

Expert Importance Measurement: For each modality $m$ , compute average routing probability $1..k$0 for expert $1..k$1 over a calibration set.
Pruning Rule: Retain only the top $1..k$2 experts per modality (plus shared experts if present), discarding the rest and dropping patch-embedding layers for unused modalities (Hanna et al., 10 Jul 2025).
Efficiency Gains: Pruning can reduce the FFN parameter count and inference FLOPs by over 66% in vision/remote sensing applications.

This enables deployment of compact, highly specialized models targeting specific modality configurations, especially valuable for tasks with limited data or compute budgets.

3. Empirical Efficacy and Ablation Studies

Benchmark Performance Highlights

MAMoE architectures have achieved state-of-the-art or highly competitive results across a broad set of tasks:

Remote Sensing: MAPEX with MAMoE surpasses SatMAE and Scale-MAE on land cover classification and flood/wildfire detection, with up to +3% accuracy gain and a 4x reduction in parameter count (Hanna et al., 10 Jul 2025).
Speech-Text Models: MoST outperforms models like SpiritLM and Moshi on ASR/TTS benchmarks and Spoken QA, and achieves leading audio language modeling accuracy (Lou et al., 15 Jan 2026). Decoder-only Conformer with MAMoE reduces WER on LibriSpeech from 6.0% to 5.6% and cuts average Common Voice WER from 12.2% to 10.6% (Lee et al., 13 Feb 2026).
Large Vision-LLMs: Modality-specific routers (LTDR) and expert-oversampling for long-tailed vision tokens yield +1.2% accuracy (StableLM) and +0.9% for pure vision, outperforming MoE-LLaVA (Cai et al., 2 Jul 2025). EvoMoE's expert evolution and hypernetwork routers yield consistent 1–1.5% accuracy gains on VQA, GQA, SQA, and TextVQA benchmarks (Jing et al., 28 May 2025).
3D Multi-Modal Understanding: MoE3D (with top-1 routing and interleaved cross-modal fusion) achieves state-of-the-art mIoU and QA scores (Li et al., 27 Nov 2025).
Multimodal Knowledge Graphs: In MMKG completion, per-modality, per-relation expert mixtures and CLUB loss yield up to 21% MRR improvement vs. baselines (Zhang et al., 2024).
Time Series Modeling: Expert Modulation (MoME) in multi-modal prediction delivers 5–40% relative improvement on trend and forecasting tasks vs. token-level fusion or unimodal baselines, with minimal parameter overhead (Zhang et al., 29 Jan 2026).

Ablations and Interpretability

Routing Mechanisms: Modality-aware routing outperforms positional and deterministic alternatives by 2–5% on various modalities (Hanna et al., 10 Jul 2025). Dynamic routers (hypernetworks) yield higher specialization and better robustness than static linear routers (Jing et al., 28 May 2025).
Expert Count and Sharing: Increasing modality-specific experts improves performance up to an optimal $1..k$3, after which overcapacity degrades accuracy for small datasets (Hanna et al., 10 Jul 2025). Shared experts can offer marginal gains at cost of extra parameters (Lou et al., 15 Jan 2026).
Pruning and Specialization: Pruned models preserve (and sometimes improve) task accuracy with dramatically fewer active parameters. Heatmaps of routing probabilities verify that experts indeed specialize per modality (Hanna et al., 10 Jul 2025, Lou et al., 15 Jan 2026).
Interpretability: Interaction-aware MAMoE (I2MoE, Time-MoE) provides both sample-level and global insights into which experts drive predictions and how interaction patterns (uniqueness, redundancy, synergy) map onto expert usage (Xin et al., 25 May 2025, Han et al., 30 Sep 2025).

4. Efficiency, Scalability, and Limitations

Computation and Memory Efficiency

MAMoE's architecture provides substantial efficiency benefits compared to dense or naïve multimodal MoEs:

Model	Approach	FFN FLOPs Savings	Zero-Shot Loss/Accuracies	Observed Limitation
MoMa-1.4B (Lin et al., 2024)	4 text + 4 image experts	3.7x overall (2.6x text/5.2x image)	Improves per-token/COCO/QA	Sensitive to input mix for load-balance
MAPEX (Hanna et al., 10 Jul 2025)	Modality-pruned experts	2–4x per modal pruning	+2–3% downstream gain	Requires expert usage calibration
MAMoE/ASR (Lee et al., 13 Feb 2026)	Disjoint pools (speech/text), top-1	Comparable active params to dense but higher cap	+0.4% WER improvement	Only two modalities shown

Across implementations, each modality group constrains eligible experts, reducing router complexity and avoiding cross-modal interference. The main design tradeoff is the need for close alignment between input token distributions and expert pools for optimal utilization.

Limitations

Scalability: Too many experts per modality (especially if dataset is small) leads to underutilized capacity and overfitting (Hanna et al., 10 Jul 2025). Some routing schemes (MoD) are sensitive to router error rates and require auxiliary routers at inference (Lin et al., 2024).
Applicability: Most MAMoE systems are restricted to a fixed set of modalities (commonly vision and text, or speech and text). Extensions to arbitrary or dynamically varying sets remain an open direction.
Training Stability: Over-regularizing with load-balance or auxiliary KL terms may degrade performance if not tuned properly (Li et al., 27 Nov 2025, Xia et al., 6 Jun 2025).

5. Theoretical and Methodological Advances

Beyond improved performance, MAMoE frameworks have introduced new methodological contributions:

Dynamic Token-Aware Routing: Hypernetwork-based routers enable per-token, per-modality generation of routing weights, adapting to both modality and token features (Jing et al., 28 May 2025).
Information-Theoretic Guidance: Partial Information Decomposition (PID) and RUS metrics (Redundancy/Uniqueness/Synergy) guide expert specialization and facilitate explicit mapping from multimodal interactions to routing (Xin et al., 25 May 2025, Han et al., 30 Sep 2025).
Expert Evolution: Rather than naive FFN cloning, expert evolution dynamically produces a set of diverse experts via gradient mixing and adaptive initialization, overcoming the issue of functionally indistinguishable experts in large LLMs (Jing et al., 28 May 2025).
Interaction-Aware Weak Supervision: Weakly supervised contrastive losses enforce targeted expert specialization in interaction-aware MAMoE, supporting unique/redundant/synergistic feature disentanglement and interpretability (Xin et al., 25 May 2025).

These advances enable the design of MAMoE modules that not only deliver efficiency and accuracy gains but also provide mechanisms for understanding and manipulating modality interactions.

6. Application Domains and Future Directions

MAMoE architectures have been applied to an expansive array of multimodal tasks and domains:

Remote sensing and earth observation (MAPEX) (Hanna et al., 10 Jul 2025)
Multimodal and multilingual ASR/TTS (MoST, decoder-only Conformer) (Lou et al., 15 Jan 2026, Lee et al., 13 Feb 2026)
Large vision-LLMs (LTDR, EvoMoE, MoMa) (Cai et al., 2 Jul 2025, Jing et al., 28 May 2025, Lin et al., 2024)
3D multi-modal scene understanding (MoE3D) (Li et al., 27 Nov 2025)
Multimodal knowledge graph completion (Zhang et al., 2024)
Multi-modal medical and activity recognition (Time-MoE) (Xin et al., 25 May 2025, Han et al., 30 Sep 2025)
Multi-modal time series forecasting (Zhang et al., 29 Jan 2026)

Open challenges and prospective research directions include:

Extension to $1..k$4-modality scenarios with advanced divergence regularization (e.g., Wasserstein over KL) (Xia et al., 6 Jun 2025).
Joint or continual learning of dynamic expert pool configurations and routing regularization (Xia et al., 6 Jun 2025, Han et al., 30 Sep 2025).
Improved training and inference robustness under domain shift, variable modality availability, and real-world data imbalance (Cai et al., 2 Jul 2025, Lin et al., 2024).
Modularization for plug-and-play addition/removal of modalities and real-time expert pool adaptation.

7. Synthesis and Research Impact

The progression from homogeneous, static MoE designs to modality- and interaction-aware mixture-of-expert architectures marks a significant advance in scalable multimodal learning. MAMoE modules have demonstrated that explicit modality conditioning in both expert partitioning and router design can substantially improve model performance, parameter utilization, and interpretability across diverse tasks. The flexibility to prune, analyze, and dynamically adapt expert structure per modality enables practitioners to match model complexity to application needs and hardware constraints.

Continued development in this space is poised to further integrate principles from information theory, sparse modeling, and transferable representation learning, making MAMoE a foundational methodology for the next generation of efficient, adaptive, and interpretable multimodal neural networks.

References:

"MAPEX: Modality-Aware Pruning of Experts for Remote Sensing Foundation Models" (Hanna et al., 10 Jul 2025)
"SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal LLMs Preserving Language Capabilities" (Xia et al., 6 Jun 2025)
"MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts" (Lou et al., 15 Jan 2026)
"MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding" (Li et al., 27 Nov 2025)
"Multiple Heads are Better than One: Mixture of Modality Knowledge Experts for Entity Representation Learning" (Zhang et al., 2024)
"Decoder-only Conformer with Modality-aware Sparse Mixtures of Experts for ASR" (Lee et al., 13 Feb 2026)
"Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-LLM" (Cai et al., 2 Jul 2025)
"EvoMoE: Expert Evolution in Mixture of Experts for Multimodal LLMs" (Jing et al., 28 May 2025)
"I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts" (Xin et al., 25 May 2025)
"Guiding Mixture-of-Experts with Temporal Multimodal Interactions" (Han et al., 30 Sep 2025)
"MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts" (Lin et al., 2024)
"Multi-Modal Time Series Prediction via Mixture of Modulated Experts" (Zhang et al., 29 Jan 2026)