MoE with Modality-Aware Routing

Updated 7 April 2026

The paper introduces a sparse, multimodal neural architecture that replaces standard feedforward layers with modality-partitioned MoE blocks to enhance expert specialization.
It details routing mechanisms using modality-specific masks, learnable biases, and dynamic gating to efficiently allocate tokens from text, image, audio, and other modalities.
The work outlines progressive training paradigms and auxiliary losses that improve robustness under missing-modality conditions and scale performance on large multimodal benchmarks.

A Mixture-of-Experts (MoE) with Modality-Aware Routing is a sparse, scalable neural architecture that dispatches representations from diverse data modalities (text, image, audio, etc.) to specialized parameter submodules, or "experts," using routing mechanisms that explicitly incorporate modality information. This paradigm is designed to improve model efficiency, parameter utilization, and specialization while retaining multi-modal reasoning and generalization. Modality-aware routing leverages the intrinsic structure of different data types to inform both the design of expert banks and the gating policies that select which experts process particular inputs. Such systems achieve significant empirical gains in large-scale vision-language and unified multimodal models, maintain robustness under heterogeneous or missing-modality conditions, and mitigate cross-modal interference.

1. Formal Architecture and Modality-Aware Routing Mechanisms

The MoE with modality-aware routing builds upon standard Transformer or encoder-decoder architectures by replacing feedforward sublayers with MoE blocks. Each block contains a set of experts, typically two-layer feedforward networks (FFN), and a router (gating network) that assigns input tokens to a sparse subset of experts. Modality-aware approaches partition the expert pool along modality axes and condition routing decisions on modality identifiers or token type, ensuring that, for example, vision tokens primarily activate vision-specialized experts, while textual tokens route to language-specialized experts.

A representative mathematical formulation as in MoST is:

$g_t = \mathrm{softmax}(W_g h_t + b_g), \quad g'_t = g_t \odot M_{m_t}, \quad I_t = \operatorname{TopK}(g'_t,K),$

with $M_{m_t}$ the modality-aware binary mask specifying eligible experts for token $t$ of modality $m_t$ . The output for $t$ is:

$y_{\mathrm{mamoe},t} = \sum_{k \in I_t} w_{t,k} E_k(h_t) + E_{\mathrm{shared}}(h_t),$

where $E_k$ are expert FFNs and $E_{\mathrm{shared}}$ is a shared, cross-modal expert facilitating information flow between modalities (Lou et al., 15 Jan 2026).

Variants include per-token hard masking with top-k selection (Li et al., 2024), soft modality biases (learnable additive vectors in the router logits) (Xia et al., 6 Jun 2025), and explicit sub-networks or gating conditioned on domain codes (Shu et al., 21 Jan 2026).

2. Expert Specialization and Progressive Training Paradigms

Effective MoE with modality-aware routing requires both expert architectural specialization and a tailored training schedule. Common patterns involve:

Modality-specific expert groups: Experts are pre-aligned to process only representations from a single modality (e.g., text, vision, audio, or speech), preventing cross-modal contamination and improving specialization (Lou et al., 15 Jan 2026, Lin et al., 2024).
Shared (cross-modal) experts: Complement modality experts with shared experts that promote information transfer and enable handling of mixed or ambiguous tokens (Lou et al., 15 Jan 2026, Wang et al., 13 Aug 2025).
Progressive or multi-stage training: Typical strategies include:
1. Connector alignment: Dedicated modules map each modality’s encoder output to a unified token space (e.g., mapping CLIP visual features via MLP or Q-Former modules), trained with cross-entropy loss on paired modality-text data (Li et al., 2024).
2. Modality-specific expert training: Each set of experts is fine-tuned only on its targeted modality data (e.g., LoRA-tuning per-modality FFNs on mode-specific instructions), yielding modality-aware routing preferences (Li et al., 2024, Jing et al., 28 May 2025).
3. Unified joint tuning: The complete multimodal MoE architecture is fine-tuned on mixed-modal datasets, with only sparse activation via top-k routing and load-balancing/auxiliary regularization where necessary (Li et al., 2024, Xia et al., 6 Jun 2025).

For handling missing or arbitrary modality combinations, specialized routers (e.g., S-Router in Flex-MoE) enforce mapping of particular input-modality sets to dedicated experts with fixed top-1 assignment (Yun et al., 2024).

3. Routing Networks, Modality Cues, and Gating Strategies

A critical component distinguishing modality-aware MoE systems is the explicit encoding of modality information into the routing process. Examples include:

Linear or MLP routers with modality-specific weights, projecting token features $x$ with $W_g^m$ and bias $M_{m_t}$ 0 tied to each modality (Lin et al., 2024).
Modality masks or indicator vectors in the gating: For a token of modality $M_{m_t}$ 1, the router applies a mask such that only experts in group $M_{m_t}$ 2 (or $M_{m_t}$ 3) can be routed to, with gated probabilities zero elsewhere (Lou et al., 15 Jan 2026, Wang et al., 13 Aug 2025).
Trainable bias vectors per modality: SMAR introduces per-modality bias vectors $M_{m_t}$ 4 (vision, text) into the router logits for each token, encouraging divergence between expert routing distributions of different modalities and reinforcing specialization (Xia et al., 6 Jun 2025).
Dynamic/hypernetwork routers: Hypernetwork-based routers as in EvoMoE produce per-token routing weights using token embeddings and modality-specific hypernets, with parameter generation conditioned on token modality (Jing et al., 28 May 2025).
Domain codes and FiLM layers: For image analysis, domain-coding in routers—where a code vector encodes the input’s modality and is applied via feature-wise affine transformations (FiLM)—provides a mechanism for pixel- or patch-wise modality‐adaptive gating (Shu et al., 21 Jan 2026).

Top-k hard gating (with or without load-balancing loss), softmax weighting, and stochastic expert selection (e.g., RL-based GRPO, Gumbel-Softmax, STE) are all deployed to sparsify the expert activation and/or promote diversity (Ko et al., 26 Mar 2026, Han et al., 30 Sep 2025).

4. Extension to Heterogeneous, Temporal, or Missing-Modality Contexts

Beyond text and vision, modality-aware MoE frameworks are employed for complex settings such as:

Speech–text integration (MAMoE): Routing is partitioned explicitly by audio (speech) vs. text token, with both intra-modality and cross-modality (shared) experts, enabling superior ASR, TTS, and spoken QA results (Lou et al., 15 Jan 2026).
Arbitrary missing-modality combinations: Flex-MoE employs a missing-modality bank and a router that is specialized (S-Router) to handle each possible subset of observed modalities, fixing the top-1 expert accordingly (Yun et al., 2024).
Modality-adaptive feature extraction and fusion: UniRoute recasts both feature extraction and fusion as routing problems, using pixel-level or spatially localized routers guided by domain codes to select appropriate experts and fusion primitives, with discrete gating enforced via a straight-through estimator (Shu et al., 21 Jan 2026).
Temporal interactions: Time-MoE routers dispatch tokens or frames based not just on their static modality but also on quantified temporal interaction metrics (redundancy, uniqueness, synergy), allowing explicit modeling of lagged cross-modal dependencies (Han et al., 30 Sep 2025).
Heterogeneous traffic modeling: TrafficMoE leverages separate MoE modules per modality (packet headers, payloads), with dynamic routing and context-driven multimodal aggregation, highlighting the paradigm’s flexibility in non-canonical multi-modal data (He et al., 31 Mar 2026).

5. Training Objectives, Auxiliary Losses, and Performance Metrics

The overall learning objective couples standard supervised (or self-supervised) losses with regularizers specific to the MoE structure and modality-aware specialization, including:

Task loss: Cross-entropy on next-token or classification targets, ASR/TTS loss for speech–text (Lou et al., 15 Jan 2026), MLM or masked modeling objectives for language/vision (Xia et al., 6 Jun 2025, He et al., 31 Mar 2026).
Load-balancing penalties: To prevent expert collapse and guarantee parameter utilization, losses such as that from GShard or Switch Transformers penalize deviation from uniform expert usage (Li et al., 2024, Xia et al., 6 Jun 2025).
Modality specialization constraints: KL-divergence or JSD between routing distributions for different modalities (SMAR loss), as well as redundancy/uniqueness/synergy auxiliary losses to align experts with information-theoretic roles (Xia et al., 6 Jun 2025, Han et al., 30 Sep 2025).
Entropy regularization, self-distillation, and gating confidence penalties: To improve both specialization and confidence in routing (Shu et al., 21 Jan 2026).
RL-based policy optimization: MoE-GRPO optimizes the routing policy directly using reward feedback for expert selection, with modality-aware masking to efficiently direct exploration (Ko et al., 26 Mar 2026).

Empirical results are consistently reported across multimodal benchmarks—visual QA, speech recognition/generation, temporal reasoning, clinical and sensor fusion—demonstrating improved parameter efficiency, accuracy on multi-modal tasks, reduction of performance bias under mixed-modality, and enhanced expert specialization (Li et al., 2024, Lou et al., 15 Jan 2026, Xia et al., 6 Jun 2025, Shu et al., 21 Jan 2026, Yun et al., 2024, Wang et al., 13 Aug 2025, Jing et al., 28 May 2025).

6. Empirical Patterns, Scalability, and Limitations

The integration of modality-aware routing in MoE architectures reveals several recurring empirical properties:

Specialization and separation: Ablation/routing visualizations (e.g., expert activation heatmaps, entropy and Gini coefficients) confirm that modality tokens cluster onto their respective expert groups, or onto shared experts for mixed content (Lou et al., 15 Jan 2026, Li et al., 2024, Wang et al., 13 Aug 2025).
Efficient scaling: Sparse MoE with modality-partitioned expert banks achieves significant FLOPs reduction (e.g., 3.7× in MoMa (Lin et al., 2024)), especially as the number of experts increases or as the expert activation per token (top-k) scales up (Li et al., 2024, Lin et al., 2024).
Robustness to missing modalities: Assignment of unique experts per available-modality combination, and the use of a modality bank for completion, yields strong gains (5–15 pt accuracy) in health data with systematically missing channels (Yun et al., 2024).
Performance resilience: Progressive and joint tuning strategies, as well as regularizers (e.g., SMAR), maintain or enhance language capabilities even under very low pure-text ratios (86.6% retention with 2.5% text in (Xia et al., 6 Jun 2025)).
Design limitations: Memory and parameter growth for missing-modality banks (exponential in modality count), static assignment rigidity, and training cost with deep or highly partitioned expert banks are key challenges (Yun et al., 2024).

Overall, MoE with modality-aware routing has demonstrated general applicability, scalability, and interpretability in multi-modal architectures. Extensions toward dynamic, context-adaptive routers, reinforcement learning-driven expert selection, and multi-resolution temporal composition continue to be actively explored for further improvements in efficiency and generalization.

Key References:

"Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts" (Li et al., 2024)
"UniRoute: Unified Routing Mixture-of-Experts for Modality-Adaptive Remote Sensing Change Detection" (Shu et al., 21 Jan 2026)
"MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts" (Lou et al., 15 Jan 2026)
"SMAR: Soft Modality-Aware Routing Strategy for MoE-based Multimodal LLMs Preserving Language Capabilities" (Xia et al., 6 Jun 2025)
"Flex-MoE: Modeling Arbitrary Modality Combination via the Flexible Mixture-of-Experts" (Yun et al., 2024)
"MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts" (Lin et al., 2024)
"EvoMoE: Expert Evolution in Mixture of Experts for Multimodal LLMs" (Jing et al., 28 May 2025)
"MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision LLMs" (Wang et al., 13 Aug 2025)
"MoE-GRPO: Optimizing Mixture-of-Experts via Reinforcement Learning in Vision-LLMs" (Ko et al., 26 Mar 2026)
"Guiding Mixture-of-Experts with Temporal Multimodal Interactions" (Han et al., 30 Sep 2025)
"TrafficMoE: Heterogeneity-aware Mixture of Experts for Encrypted Traffic Classification" (He et al., 31 Mar 2026)
"Routing Experts: Learning to Route Dynamic Experts in Multi-modal LLMs" (Wu et al., 2024)