Mixture of Modality Experts (MoME)

Updated 26 January 2026

MoME is a class of architectures that integrates modality-specialized experts using dynamic gating to fuse heterogeneous data.
It employs advanced routing, hierarchical mixtures, and modality-aware regularization to enhance interpretability and robustness.
MoME frameworks achieve state-of-the-art performance in domains like medical imaging, speech-text models, and knowledge graph completion.

A Mixture of Modality Experts (MoME) is a class of architectural primitives and frameworks designed for flexible, adaptive, and interpretable fusion of heterogeneous data modalities. MoME constructs typically combine modality-specialized expert modules with a gating or routing mechanism that dynamically selects or weights expert outputs according to task context, input properties, or cross-modal interactions. Such architectures have advanced state-of-the-art performance in domains including medical imaging, speech-text integration, multimodal LLMs, and knowledge graph completion, due to their capacity for modular specialization, scalable fusion, and robustness to missing or variable inputs.

1. Core Architectural Principles

The defining structural principle of MoME architectures is the integration of multiple expert networks—each tailored to a specific data modality (e.g., MRI sequence, text, image, audio)—within a shared backbone, such as a Transformer or other deep neural architecture. These experts are modulated by routers or gating mechanisms, which compute mixture weights or hard decisions based on input context, task instructions, or interaction signals. Key patterns in MoME implementations include:

Modality-specialized experts: Each expert Eₘ processes inputs from a distinct modality or a predefined combination of modalities. Experts may be instantiated as convolutional networks, transformers, MLPs, or task-specific modules (Lou et al., 15 Jan 2026, Zhang et al., 2024, Xiong et al., 2024).
Routing/gating mechanisms: Gate scores are typically computed via MLPs or attention layers; the gating function may depend on global description vectors, instance- or token-level features, or recognized interaction patterns (Chopra et al., 10 Jun 2025, Liu et al., 21 Jan 2025, Xin et al., 25 May 2025). Gating weights are frequently normalized by softmax, and routing can be soft or hard.
Hierarchical or multi-level mixture: Some frameworks implement multi-stage or multi-layer mixtures (e.g., per-layer in transformer blocks, per-resolution in U-Net-style architectures, or hierarchical fusion based on interaction types) (Rezvani et al., 30 Oct 2025, Shen et al., 2024).
Modality-aware load balancing: To avoid collapse (where gating converges to a subset of experts), regularization terms encourage balanced routing or explicitly penalize expert overuse (Lou et al., 15 Jan 2026, Zhang et al., 27 May 2025).
Fusion paradigms: Outputs of selected experts are fused by weighted summation, concatenation, or cross-attention, optionally followed by aggregation or downstream prediction heads.

2. Gating, Routing, and Specialization Mechanisms

MoME systems employ gating mechanisms to select or weight expert outputs, often in direct relation to the modalities present and the semantic demands of the input or task. Formulations range from simple modality-indexed hard routing (e.g., VLMo (Bao et al., 2021)) to context-dependent soft or sparse top-K gating (e.g., Uni3D-MoE (Zhang et al., 27 May 2025), Flex-MoE (Yun et al., 2024), MoST (Lou et al., 15 Jan 2026)). Representative mathematical forms include:

Report-conditioned hard gating (MedMoE):

$\mathbf{g} = \mathrm{softmax}(W_2\,\mathrm{ReLU}(W_1\,\mathbf{t}_g))$

where $\mathbf{t}_g$ encodes diagnostic report semantics, and gating selects among $K$ experts (Chopra et al., 10 Jun 2025).

Sparse token-level routing (Uni3D-MoE):

$\pi_i^{(e)} = \mathrm{softmax}(w_e^\top f_i)$

with top-K selection for computational efficiency and adaptive specialization per modality and per query (Zhang et al., 27 May 2025).

Modality-aware batch-level and instance-level routing (Flex-MoE):
- Two schemes: generalized router ( $\mathcal{G}$ -Router) for all modalities, specialized router ( $\mathcal{S}$ -Router) for observed subsets; mixtures reflect observed combinations, leveraging a “missing modality bank” for arbitrary input configuration (Yun et al., 2024).
Interaction-aware gating (I2MoE):
- Weights computed over fusion experts encoding uniqueness, synergy, and redundancy; interpretability achieved via inspection of weight distributions at sample and dataset levels (Xin et al., 25 May 2025).

3. Representative Methodologies and Application Domains

MoME architectures have found substantial adoption and success in the following domains:

Medical Imaging and Vision-Language Fusion: Adaptively routable experts over multi-scale CT, MRI, or cross-modal features, leveraging clinical text or report context for routing, e.g. MedMoE (Chopra et al., 10 Jun 2025), brain lesion segmentation MoME (Zhang et al., 2024, Rezvani et al., 30 Oct 2025), breast cancer multiparametric MRI analysis (Luo et al., 2024).
Speech-Text Multimodal LLMs: Specialized text and audio expert groups plus shared experts, gated via explicit modality masks and per-token indicators, as in MoST (Lou et al., 15 Jan 2026).
Entity Representation and Knowledge Graph Completion: Relation-guided modality knowledge experts with mutual-information-based disentanglement, supporting adaptive, relation-aware entity embeddings for MMKG completion (Zhang et al., 2024).
Multimodal LLMs: Task-specialized vision and language expert modules embedded in LLMs for robust OCR, VQA, document and chart QA, e.g. MoME (Shen et al., 2024), VLMo (Bao et al., 2021).
Arbitrary Modality Combinations and Robustness to Missing Data: Flex-MoE introduces the missing modality bank and expert-per-combination routing, enabling scalable, robust inference even when only partial modality sets are available (Yun et al., 2024).
Multimodal Interaction and Fake News Detection: Hierarchical mixture-of-experts encoding canonical interaction scenarios (agreement, disagreement, alignment/misalignment), with gating on interaction signals for robust fusion and interpretability (Liu et al., 21 Jan 2025).
Sleep Staging: Three-pathway MoME in sDREAMER enables in situ classification from EEG, EMG, or fused, with self-distillation to maximize cross-modal alignment and performance in both single and multi-channel inference (Chen et al., 27 Jan 2025).
3D Scene Understanding: Token-level sparse MoE routing for 3D LLMs integrating RGB, depth, BEV, point cloud, and voxel representation, as in Uni3D-MoE (Zhang et al., 27 May 2025).

4. Empirical Performance and Scalability

MoME-based models demonstrate consistent empirical superiority to non-adaptive, monolithic, or single-expert fusion baselines. Key empirical results include:

State-of-the-art performance on medical benchmarks: MedMoE achieves SOTA in alignment and retrieval across CT, US, MRI datasets (Chopra et al., 10 Jun 2025); breast MRI MoME matches or exceeds radiologist-level detection performance (Luo et al., 2024); MoME segmentation frameworks outperform multi-talent and universal U-Net baselines on a variety of lesion types (Zhang et al., 2024, Rezvani et al., 30 Oct 2025).
Scalable adaptation and robustness: Flex-MoE achieves superior classification with up to +7.6% accuracy gain over prior FuseMoE in full-modality and variable modality settings while using fewer parameters (Yun et al., 2024). Token-level routing allows sparse activation of expert weights (25% utilized per inference in Uni3D-MoE) for computational efficiency (Zhang et al., 27 May 2025).
Ablation studies underscore necessity of expert specialization, gating, and regularization: Performance drops significantly when expert sets are collapsed, gating is static or non-adaptive, or regularizers omitted (Xin et al., 25 May 2025, Zhang et al., 2024, Lou et al., 15 Jan 2026).
Generalization and interpretability: MoME frameworks display increased zero-shot generalization to unseen datasets/modalities and interpretable fusion, e.g. via Shapley values, gating inspection, or t-SNE visualization of expert assignments (Luo et al., 2024, Xin et al., 25 May 2025).

5. Interpretability and Information-Theoretic Analysis

Interpretability is central in several MoME variants. Information-theoretic optimality for gating (e.g., mutual-information minimization, PID-inspired regularizers) ensures experts capture distinct interaction patterns or modality-specific features (Xin et al., 25 May 2025, Zhang et al., 2024). Direct inspection of gating scores provides real-time insight into model decisions, facilitating sample-level and global explanations.

I2MoE assigns explicit weights to uniqueness, synergy, and redundancy experts, supporting both local and global interpretation of multimodal interactions (Xin et al., 25 May 2025).
Medical imaging MoME models leverage saliency via integrated gradients and Shapley values for lesion and modality contribution explanation (Luo et al., 2024).
MoST demonstrates lower gating entropy, lower Gini coefficients, and more equitable utilization, ensuring specialization and capacity are balanced among expert groups (Lou et al., 15 Jan 2026).

6. Limitations, Extensions, and Open Challenges

Although MoME architectures offer compelling advantages, several limitations and open research avenues persist:

Expert number and bank size scalability: As modalities and their combinations proliferate, the required bank size and expert allocation can become prohibitive (bank size scales as $2^{|M|}\cdot|M|$ in Flex-MoE) (Yun et al., 2024).
Handling unseen or sparse modality combinations: Fixed expert-per-combination schemes may lack support for previously unobserved or rarely represented modality sets (Yun et al., 2024).
Expert collapse and overfitting: Without adequate balancing losses or curriculum learning, gating can overselect a subset of experts and degrade specialization (Xin et al., 25 May 2025, Zhang et al., 2024).
Imputation and missing data: Quality of real-time imputation (e.g., missing modality bank) is sensitive to learning and representation methodology (Yun et al., 2024).
Computational efficiency: MoME modules impose nontrivial overhead (see MoME's gating network increase from 7.5 GB to 38 GB (Zhang et al., 2024)) but can be optimized using sparse routing and efficient expert selection.

Potential future directions include hierarchical mixtures, dynamic expert allocation, improved imputation for missing modalities, self-supervised pretraining of experts, and systematic contrastive alignment prior to multi-modal fusion (Zhang et al., 27 May 2025, Zhang et al., 2024, Yun et al., 2024).

7. Summary Table: MoME Variants and Key Features

Framework / Paper	Domain	Expert Structure	Routing / Gating Mechanism
MedMoE (Chopra et al., 10 Jun 2025)	Medical VL grounding	Multi-scale conv experts	Report-conditioned hard/soft router
MoST (Lou et al., 15 Jan 2026)	Speech-text LLM	Text/audio/shared experts	Modality-masked sparse router
Flex-MoE (Yun et al., 2024)	Arbitrary multimodal	Per-combination experts	Generalized + Specialized routers
VLMo (Bao et al., 2021)	Vision-language pretraining	Vision, language, VL experts	Input-indexed hard gating
Uni3D-MoE (Zhang et al., 27 May 2025)	3D scene understanding	8 expert sparse MoE	Token-level top-2 router, balancing
MoME (Luo et al., 2024)	Breast MRI management	Sparse/soft modality experts	Modality-sequential + soft fusion
MoME (Zhang et al., 2024)	Brain lesion segmentation	Modality-specific U-Net experts	Hierarchical UNet gating network
I2MoE (Xin et al., 25 May 2025)	Multimodal interaction	PID-inspired interaction experts	Reweighting MLP, interpretability
MIMoE-FND (Liu et al., 21 Jan 2025)	Fake news detection	Hierarchical iMoE blocks	Interaction-class gating

Each variant specifies experts attuned to different community or technical goals, but the unifying MoME paradigm—specialization, mixture, adaptive routing, modularity—remains central to multimodal model advances across current research fronts.