Modality-Specific Mixture-of-Experts

Updated 4 July 2026

MS-MoE is a family of sparse multimodal architectures that condition expert selection and computation on modality signals rather than treating all tokens uniformly.
Architectural patterns range from shared expert pools with modality-controlled routing to partitioned and hybrid designs that balance specialization and cross-modal transfer.
Training strategies incorporate modality-aware routing modifications with auxiliary losses to prevent expert collapse and enhance performance across diverse domains.

Searching arXiv for the cited MS-MoE papers to ground the article in current literature. Modality-Specific Mixture-of-Experts (MS-MoE) denotes a family of sparse multimodal architectures in which expert selection, expert computation, or both are conditioned by modality information rather than treating all tokens as exchangeable. Across recent work, the central aim is to preserve modality-specific inductive structure while retaining selective cross-modal transfer. In some instantiations, experts are partitioned into modality-specific groups; in others, a single expert pool is steered by an auxiliary modality through routing shifts, affine modulation, or routing regularization. This makes MS-MoE distinct from dense shared-parameter fusion and from token-level fusion that mixes modalities in one latent token space and depends on fine-grained alignment (Zhang et al., 29 Jan 2026).

1. Definition and conceptual boundaries

In the broadest formulation, an MS-MoE combines a sparse MoE backbone with modality-aware routing constraints. The defining property is not a single fixed architecture, but the use of modality signals to determine which experts are eligible, preferred, or modulated during processing. That property appears in several forms: text-conditioned routing for time-series experts, modality-partitioned expert banks in vision-language and speech-text models, and hybrid designs with modality-specific plus shared experts for cross-modal transfer (Xia et al., 6 Jun 2025).

A recurrent distinction in the literature is between modality specificity and token-level fusion. In token-level fusion, temporal patches, image patches, or other modality tokens are concatenated or cross-attended in a shared space. In MS-MoE, the cross-modal interaction may instead occur at the level of expert selection or expert function. The time-series formulation "Expert Modulation" is explicit on this point: the auxiliary modality does not need to be fused into the temporal tokens; rather, it shifts routing scores and modulates expert outputs, so that text steers temporal experts without turning them into joint text-time experts (Zhang et al., 29 Jan 2026).

Several papers also delimit MS-MoE against hard architectural partitioning. "SMAR" argues that hard splitting experts by rule can degrade language ability because pretrained expert knowledge is severed and substantial visual pretraining may be needed to refill the partitioned experts; its alternative is soft routing regularization via modality routing distributions rather than hard expert assignment (Xia et al., 6 Jun 2025). Conversely, "MoMa" and "MoE-TTS" intentionally use hard modality constraints: image tokens are restricted to image experts and text tokens to text experts in the former, while text tokens always use frozen text experts and speech tokens always use speech experts in the latter (Lin et al., 2024); (Xue et al., 15 Aug 2025).

A common misconception is therefore that MS-MoE always means one expert pool per modality. The literature does not support that simplification. MS-MoE can be realized by separate expert groups, by a shared pool with modality-aware regularization, or by a single modality-specialist pool controlled by another modality. The unifying criterion is modality-conditioned sparse computation, not a unique topological template.

2. Architectural patterns

Recent systems instantiate MS-MoE through a small number of recurring architectural patterns. These patterns differ mainly in where modality information enters: before routing, inside the experts, or through auxiliary losses.

Pattern	Representative systems	Defining characteristic
Shared pool, modality-controlled	MoME, EvoMoE	One expert pool; modality changes routing or expert behavior
Partitioned pools	MoMa, MoE-TTS	Tokens are restricted to modality-designated experts
Partitioned pools plus shared experts	MoIIE, MoST	Intra-modality experts coexist with shared cross-modal experts
Modality-specific plus cross-modal banks	DynFS-MoE	Separate modality banks and explicit cross-modal experts

The first pattern is exemplified by the time-series "Expert Modulation" formulation, where the expert pool remains temporal, but text modulates both routing and expert outputs. Router Modulation adds a text-dependent shift to base routing scores, and Expert-independent Linear Modulation applies a per-expert affine transform to the selected expert outputs. The experts remain functionally time-series specialists; text operates as a controller (Zhang et al., 29 Jan 2026). "EvoMoE" adopts a related shared-pool view in multimodal LLMs, but makes routing modality-aware through modality-specific hypernetworks that generate token-conditioned router parameters, so image and text tokens induce different expert distributions without creating disjoint expert pools (Jing et al., 28 May 2025).

The second pattern uses explicit modality partitioning. "MoMa" divides the FFN experts into text and image groups; a token’s modality determines its group, and routing occurs only within that group. Cross-modal exchange is delegated to shared attention rather than shared FFNs (Lin et al., 2024). "MoE-TTS" goes further and removes learned gating entirely: routing is deterministic and token-level, with text tokens always using the frozen text pathway and speech tokens always using a duplicated speech pathway over Q, K, V, O, FFN, and LN components (Xue et al., 15 Aug 2025).

The third pattern introduces both modality-specific and shared experts. In "MoIIE", image tokens can route to vision experts or shared inter-modality experts, while text tokens can route to language experts or the same shared experts. The reported layerwise behavior is that shallow layers favor intra-modality experts and deeper layers increasingly select shared experts, which is presented as evidence that cross-modal alignment is concentrated later in the stack (Wang et al., 13 Aug 2025). "MoST" adopts an analogous speech-text design with disjoint text and audio expert groups plus a parallel shared expert MLP whose output is added to the routed mixture output (Lou et al., 15 Jan 2026).

A fourth pattern adds explicit cross-modal expert banks. "DynFS-MoE" for post-traumatic epilepsy diagnosis has separate fMRI experts, sMRI experts, and cross-modal experts over fused functional-structural embeddings. Its Modality-Class Mixture-of-Experts gate then produces class-conditioned weights over these banks, so modality usage varies by binary task and class (Ding et al., 15 Jun 2026). This suggests that MS-MoE can be indexed not only by modality but also by task or diagnostic objective.

3. Routing and mathematical formulations

Most MS-MoE systems preserve the standard MoE computation pattern—router scores, sparse expert selection, weighted aggregation—but modify the gating law using modality information. In "Expert Modulation", if $x_p$ is a time token and $z$ is a pooled text-conditioning vector, the router is shifted as

$g(x_p \mid Z) = g_0(x_p) + W_G z,$

followed by either a softmax or Top- $K$ masking. Expert outputs are then modulated by

$E_i(x_p \mid Z) = \gamma_i(Z) \cdot E_i(x_p) + \beta_i(Z),$

and the mixture becomes

$\mathrm{MoME}(x_p \mid Z) = \sum_i \lambda_i\, g_i(x_p \mid Z)\, E_i(x_p \mid Z).$

This is a canonical example of function-level multimodal control rather than representation-level fusion (Zhang et al., 29 Jan 2026).

Partitioned-pool systems instead restrict the admissible expert set. In "MoIIE", a modality-specific router computes

$G^{\mathcal{M}}(x_n) = \mathrm{Softmax}(\mathrm{top\text{-}K}(x_n \cdot W_g^{\mathcal{M}})),$

after which a text token selects from language plus shared experts and an image token selects from vision plus shared experts. The sparse forward map is therefore modality-gated before aggregation, not merely regularized after the fact (Wang et al., 13 Aug 2025). "MoST" formalizes this as masked routing over 64 routed experts split 50/50 between text and audio, with a shared expert added in parallel:

$y_{\mathrm{mamoe}} = y_{\mathrm{routed}} + E_{\mathrm{shared}}(h).$

The modality mask zeroes out experts outside the token’s modality-specific group before Top- $K$ selection (Lou et al., 15 Jan 2026).

Several papers refine this baseline in different directions. "SMAR" augments router logits with trainable modality-aware bias vectors $b_v$ and $z$ 0, computes modality routing distributions from top- $z$ 1 dispatch statistics, and constrains the symmetric KL divergence between text and vision routing distributions to stay within a tolerance band $z$ 2 (Xia et al., 6 Jun 2025). "SMoES" uses dynamic soft modality scores derived either from attention propagation or Gaussian feature statistics, then maximizes inter-bin mutual information between modality scores and expert bins rather than directly forcing expert identities (Bo et al., 27 Apr 2026). "Guiding Mixture-of-Experts with Temporal Multimodal Interactions" generalizes the routing signal further by computing temporal redundancy, uniqueness, and synergy and then setting router logits as

$z$ 3

where RUSContext combines attention over pairwise redundancy/synergy and a GRU over uniqueness sequences (Han et al., 30 Sep 2025).

These variants collectively show that modality-aware routing can be expressed as masking, logit shifting, bias injection, interaction-aware context augmentation, or direct expert-output modulation. A plausible implication is that the term MS-MoE now covers a spectrum from hard eligibility constraints to soft distributional shaping.

4. Training objectives and regularization

The task loss in MS-MoE remains domain-specific—cross-entropy for language generation or classification, regression losses for forecasting—but recent work emphasizes auxiliary terms that stabilize specialization. In multimodal time-series forecasting, the full objective can include forecasting loss, load-balancing, entropy or sparsity penalties, and modulation regularization:

$z$ 4

The stated rationale is that forecast loss drives accuracy, routing regularizers balance utilization and specialization, and modulation regularization prevents over-amplification by text (Zhang et al., 29 Jan 2026).

By contrast, some papers argue that conventional load balancing can be counterproductive when modality specialization is the primary goal. "SMAR" reports that adding load balancing reduced the minimum modality-routing-distance and slightly degraded language scores, so its final model omits the load-balancing term and relies on the symmetric-KL tolerance-band penalty instead (Xia et al., 6 Jun 2025). "MoE3D" makes a closely related observation in 3D understanding: router $z$ 5-loss is more impactful than a separate load-balancing loss, and combining both slightly harms performance, which the paper attributes to conflicting constraints that reduce specialization (Li et al., 27 Nov 2025).

Other objectives explicitly target modality structure rather than uniformity. "SMoES" optimizes a language-modeling loss plus per-bin load balancing and an inter-bin mutual-information term,

$z$ 6

where the MI term aligns soft modality scores with expert-bin specialization (Bo et al., 27 Apr 2026). "DynFS-MoE" introduces a conditional mutual information regularizer $z$ 7 so that patch-expert interactions become class-distinctive, yielding an objective of the form

$z$ 8

Here the goal is interpretability and class-conditioned routing rather than balanced expert traffic (Ding et al., 15 Jun 2026).

Taken together, these training choices indicate a persistent design tension: too little regularization risks expert collapse, but too much uniformization can suppress the very modality specialization that motivates MS-MoE. The literature does not present a single universally optimal solution.

5. Empirical evidence across application domains

Empirical results indicate that MS-MoE yields gains in several problem families, but the mechanism of improvement depends on the domain. In multimodal time-series prediction, "Multi-Modal Time Series Prediction via Mixture of Modulated Experts" reports that its Expert Modulation method improves MTBench Finance short-horizon trend prediction from a best baseline of 49.315% to 66.849% for 3-way classification, improves MTBench Weather long-horizon forecasting to MSE 11.823 and MAE 2.620, and improves TimeMMD Environment forecasting to MAPE 15.434 and MAE 8.317. The same paper reports lower training time per iteration than token-fusion variants on MT-Finance, with EM at approximately 0.47s versus early fusion at approximately 1.19s (Zhang et al., 29 Jan 2026).

In 3D scene understanding, "Uni3D-MoE" reports EM@1 = 30.8 and CIDEr = 97.6 on ScanQA, EM@1 = 57.2 on SQA3D, and balanced expert loads under a sparsity-aware balancing loss, with qualitative routing analyses showing preferences such as voxel/point-cloud-heavy experts and RGB/BEV-heavy experts (Zhang et al., 27 May 2025). "MoE3D" reports 44.4% mIoU on ScanRefer and 48.8% mIoU on Multi3DRefer, surpassing SegPoint by +1.1% and +6.1% mIoU respectively, and SQA3D performance of EM = 56.0 and EM-R = 58.9 under top-1 sparse routing with router $z$ 9-loss (Li et al., 27 Nov 2025).

In large vision-LLMs, the empirical picture is more mixed but still favorable to modality-aware sparsity. "SMAR" reports 86.6% language retention with only 2.5% pure-text data, compared with 81.6% without auxiliary loss and 82.8% with load balancing alone, while maintaining multimodal scores such as VQA-v2 82.4 and MMBench 72.7 (Xia et al., 6 Jun 2025). "SMoES" reports average gains of 0.9% on multimodal tasks and 4.2% on language-only tasks across four MoE-based VLMs, together with a 56.1% reduction in expert-parallel communication overhead and 12.3% throughput improvement (Bo et al., 27 Apr 2026). "MoMa" reports pre-training FLOPs savings of 3.7x overall, with 2.6x for text and 5.2x for image processing, for a 1.4B model with four text experts and four image experts relative to a compute-equivalent dense baseline, outperforming a mixed-modal expert-choice MoE on the same training budget (Lin et al., 2024).

Medical and speech applications show that the same design principle extends beyond standard VLM settings. "MedMoE" reports zero-shot classification scores including 78.32% on RSNA, 69.32% on Breast ultrasound, and state-of-the-art CT view results of 40.00%, 28.32%, and 26.83% on axial, coronal, and sagittal datasets, while using hard top-1 routing conditioned on report embeddings (Chopra et al., 10 Jun 2025). "DynFS-MoE" reports AUC 0.84±0.10 and F1 0.81±0.08 on HC vs. PTE, together with statistically significant gains over the best baseline on PTE-related tasks (Ding et al., 15 Jun 2026). "MoST" reports ASR WER 2.0 on LibriSpeech-clean and competitive spoken QA scores, and "MoE-TTS" reports out-of-domain Overall Alignment 3.75±0.097, exceeding ElevenLabs at 3.39±0.101 and MiniMax at 3.30±0.091 on its curated OOD description set (Lou et al., 15 Jan 2026); (Xue et al., 15 Aug 2025).

A cautious reading of these numbers suggests that MS-MoE is particularly beneficial when modalities differ strongly in statistical structure, token density, or reliability, and when uniform processing would otherwise induce interference or poor allocation of capacity.

6. Limitations, controversies, and future directions

Despite strong results, the literature identifies several unresolved issues. Expert collapse remains a recurring risk. Router modulation in time-series models can over-concentrate on a few experts if not regularized, and strong load balancing does not necessarily help because time-series specialization may benefit from non-uniform utilization (Zhang et al., 29 Jan 2026). In VLMs, threshold misconfiguration can also induce collapse: "SMAR" reports that low tolerance bands such as $g(x_p \mid Z) = g_0(x_p) + W_G z,$ 0 lead to routing collapse onto a single expert in early layers (Xia et al., 6 Jun 2025).

A second controversy concerns the role of balancing losses. Some systems use them successfully—"Uni3D-MoE" attributes relatively balanced token assignment proportions to its sparsity-aware balancing loss—whereas others report that balancing interferes with specialization or language retention (Zhang et al., 27 May 2025). This is not a contradiction so much as an indication that the interaction between routing constraints and modality structure is highly architecture-dependent. "SMoES" explicitly positions its mutual-information regularizer as a way to preserve load balancing while still obtaining coherent specialization (Bo et al., 27 Apr 2026).

A third issue is whether modality specificity alone is sufficient for cross-modal reasoning. Several papers say no. "AsyMoE" argues that a modality-specific MoE by itself struggles with hierarchical text-image relations and that deeper language experts drift toward parametric memory; it therefore adds hyperbolic inter-modality experts and evidence-priority language experts, reporting average accuracy gains of 26.58% over vanilla MoE and 15.45% over modality-specific MoE, with 25.45% fewer activated parameters than dense models (Zhang et al., 16 Sep 2025). Likewise, "MoIIE" shows that purely modality-only expert pools are inferior to designs that include shared inter-modality experts, especially on knowledge QA and hallucination robustness (Wang et al., 13 Aug 2025).

Future directions named in the literature are correspondingly broad: adding further modalities such as images, audio, EEG, or DTI; using domain tags or learned adapters for domain adaptation; designing confidence-aware modulation for ambiguous auxiliary signals; scaling interaction-aware routing to fully cross-lagged temporal settings; and improving uncertainty quantification or deployment efficiency (Zhang et al., 29 Jan 2026); (Ding et al., 15 Jun 2026); (Han et al., 30 Sep 2025). A plausible synthesis is that MS-MoE is evolving from simple modality partitioning toward structured conditional computation in which modality, task, temporal interaction, and evidence reliability jointly shape sparse expert activation.