MoME Transformer: Mixture-of-Modality Experts

Updated 22 January 2026

MoME Transformer is a neural architecture that uses modality-specialized experts to process heterogeneous data like images, text, and clinical signals.
It employs advanced gating methods—such as soft, hard, and instance-level routing—to enable both specialized within-modality processing and effective cross-modal fusion.
Empirical results show significant improvements in accuracy and efficiency across varied tasks including medical diagnostics, vision-language, and 3D understanding.

A Mixture-of-Modality-Experts (MoME) Transformer is a neural architecture that integrates modality-specialized expert modules into the Transformer framework, enabling dynamic, adaptive, or hard-routed fusion of heterogeneous inputs such as images, text, speech, clinical data, or biochemical graphs. Each expert in a MoME Transformer is designed to process a particular modality or type of information, and their outputs are aggregated via data-driven gating or routing mechanisms. These architectures allow for both within-modality specialization and cross-modality fusion, supporting diverse downstream tasks that require reasoning over more than one data source.

1. Core Principles of MoME Transformer Architectures

MoME Transformers extend the standard self-attention Transformer design by incorporating multiple expert subnetworks; each expert is specialized for a particular input modality or processing regime. The key components are:

Modality-Specific Encoders/Experts: Each modality (e.g., vision, text, speech, MRI types, proteins) has a dedicated expert, typically implemented as a two-layer MLP or an adapter.
Shared or Cross-Modality Layers: Certain layers (often multi-head self-attention) are shared across modalities to promote cross-modal information exchange and representation alignment.
Gating/Expert Routing: Inputs are dynamically (soft, probabilistic) or statically (hard, deterministic) assigned to experts via a learned gating function or a fixed routing schedule. This mechanism can be conditioned on clinical features, modality tags, or semantic embeddings.
Fusion and Output Aggregation: The outputs of the experts are fused by the gating weights—a weighted sum or per-token expert selection—yielding a joint representation for downstream reasoning.

This structure accommodates modality heterogeneity, prevents modeling interference between tasks, and supports specialization, scaling, and data efficiency (Bao et al., 2021, Shen et al., 2024, Raza et al., 17 Jun 2025, Chen et al., 27 Jan 2025, Lou et al., 15 Jan 2026, Li et al., 8 Sep 2025, Chopra et al., 10 Jun 2025, Luo et al., 2024, Li et al., 27 Nov 2025, Lin et al., 2024).

2. Gating, Routing, and Fusion Mechanisms

MoME systems deploy a variety of gating techniques, determining how modality inputs are processed and fused:

Soft Gating: Learnable gating functions assign continuous weights to each expert, typically via a softmax of gating logits computed from input (e.g., clinical data in NeuroMoE, text embeddings in MoME/MedMoE) (Raza et al., 17 Jun 2025, Shen et al., 2024, Chopra et al., 10 Jun 2025).
Hard Routing: Tokens or sequences are assigned to a single expert per modality or per token via a deterministic mapping, as in sDREAMER and VLMo (Chen et al., 27 Jan 2025, Bao et al., 2021).
Hierarchical or Instance-Level Gating: Some approaches, like MoME and MedMoE, use instance-level gating, where the gating vector is conditioned on sample-specific meta-data (e.g., sentence embeddings, report embeddings) (Shen et al., 2024, Chopra et al., 10 Jun 2025).
Top-K/Top-1 Routing and Masked Experts: In large-scale models, such as MoST, MoMa, and MoE3D, only a small subset (K out of N) of experts is activated per token for computational efficiency (Lou et al., 15 Jan 2026, Lin et al., 2024, Li et al., 27 Nov 2025).

The fusion operation aggregates expert outputs, either as a convex combination (soft) or by direct selection (hard), yielding the final multi-modal embedding.

3. Architectural and Mathematical Specification

A generic mathematical schema for a MoME block is as follows:

Encoding: For M input modalities $x_m$ , each is mapped to an embedding by a modality-specific encoder $e_m = E_m(x_m)$ .
Expert Processing: Each embedding $e_m$ is refined by its expert $h_m = \mathrm{Expert}_m(e_m)$ .
Gating Weights: Gating scores $g_m$ (softmax or indicator) are computed from features such as $x_4$ (clinical) or instance-level embeddings:

$g_m = \frac{\exp(\alpha_m)}{\sum_{k=1}^M \exp(\alpha_k)}, \quad \alpha = \mathcal{G}(\cdot)$

Fusion: Fused vector $z = \sum_{m=1}^M g_m\, h_m$ .
Final Prediction: $y = f_{\mathrm{head}}(z)$ ; supervised cross-entropy or task-specific loss.

Many instantiations layer these operations within the Transformer backbone, replacing or augmenting the canonical feed-forward layers with modality-partitioned or cross-modal expert modules (adapters, adapters+FFN, or MoE MLP heads), with residuals and LayerNorm applied in standard or Pre-LN fashion (Bao et al., 2021, Shen et al., 2024, Lin et al., 2024).

A representative table enumerates gating protocols across recent models:

Model	Gating/Routing	Fusion Operation
NeuroMoE	Softmax on clinical	Weighted sum over experts
VLMo	Hard (layer+token)	Expert assigned deterministically
MoME	Instance-level soft, top-1	Add selected adapter to FFN
MoE3D	Softmax + top-1	One expert per superpoint
MoST	Modality mask + top-K + shared	Weighted sum + shared expert

4. Training Protocols and Loss Functions

MoME architectures are trained end-to-end using combinations of standard and auxiliary loss criteria:

Task Losses: Cross-entropy for classification, focal loss (CAME-AB), binary cross-entropy (MoE3D), or causal LM loss (MoME, MoMa, MoST).
Auxiliary Regularizers: Load-balancing (Lou et al., 15 Jan 2026), gating-balance (Raza et al., 17 Jun 2025), expert diversity (Li et al., 8 Sep 2025), and z-loss for stable routing (Li et al., 27 Nov 2025).
Contrastive/Alignment Losses: Used in MedMoE, CAME-AB, and VLMo for aligning vision-language representations.
Self/Teacher Distillation: sDREAMER employs self-distillation between mix-path and mono-modal experts for information transfer (Chen et al., 27 Jan 2025).
Multi-Stage Pretraining: Stagewise (v-only, t-only, v+t) pretraining for VLMo, or cross-modal post-training and instruction-finetuning for MoST (Bao et al., 2021, Lou et al., 15 Jan 2026).

5. Empirical Results and Impact

Empirical studies demonstrate consistent advantages for MoME frameworks across domains:

Medical Diagnostics: Substantial gains on clinical and imaging benchmarks, such as +10% accuracy for neurodegenerative disease classification (Raza et al., 17 Jun 2025), radiologist-level performance in breast cancer MRI (Luo et al., 2024), and improved sample efficiency in vision-language medical retrieval (Chopra et al., 10 Jun 2025).
Vision-Language Modeling: MoME and VLMo achieve or exceed state-of-the-art on VQA, NLVR2, image-text retrieval, and document understanding without increasing inference cost (Bao et al., 2021, Shen et al., 2024).
Multimodal Language Modeling: MoST (speech-text) and MoMa (text-image) yield large pretraining FLOP reductions (3–4×) and maintain or improve generalization on audio, text, and multimodal benchmarks (Lou et al., 15 Jan 2026, Lin et al., 2024). MoST sets new state-of-the-art on several spoken QA and audio language understanding tasks.
3D Understanding and Cross-modal Learning: MoE3D outperforms dense and prior fusion architectures by >6 mIoU on Multi3DRefer and achieves SOTA on 3D situated QA (Li et al., 27 Nov 2025).

6. Limitations, Extensions, and Open Directions

Identified limitations and research opportunities include:

Dataset Size and Modality Completeness: Clinical datasets are often small or missing modalities; future work targets handling incomplete or partial modality observations (Raza et al., 17 Jun 2025, Luo et al., 2024).
Routing Bottlenecks: Soft and hard gating introduce trade-offs between fine-grained adaptivity and inference complexity. Errors in auxiliary router training can degrade performance, particularly in deep (MoD) or autoregressive settings (Lin et al., 2024).
Load-Balancing and Specialization: While explicit regularization can help, most systems rely on architectural constraints and training data mix for balanced expert usage. Dynamic or hierarchical expert architectures represent plausible future approaches (Bao et al., 2021, Shen et al., 2024).
Scalability: Extending MoME to tens or hundreds of modalities, or introducing intra-layer or token-level gating, remains an active area of investigation for greater specialization and adaptability (Raza et al., 17 Jun 2025, Shen et al., 2024).
Interpretability: Integrated Gradients, Shapley values, and attention maps have been used to provide local interpretability and modality attribution, especially in medical settings (Luo et al., 2024, Chopra et al., 10 Jun 2025).

7. Representative MoME Frameworks and Modalities Table

Model	Target Domain	Modalities	Routing Type	Notable Innovations
NeuroMoE	Neurodiagnostic MRI	aMRI, DTI, fMRI, clinical	Softmax gating	Modality-driven personalized fusion
VLMo	Vision-Language, Retrieval	Images, Text	Deterministic (per-layer, per-token)	Staged pretraining, dual/fusion modes
MoME	Vision-Language LLMs	Multiple visual, text	Instance-level, top-1	MoVE (vision), MoLE (language)
MoST	Speech, Text LLM	Speech, Text	Modality-masked Top-K	Modality-specific + shared experts
MedMoE	Medical VL understanding	X-ray, CT, MRI, US	Report/metadata-gated soft	Multi-scale, report-adaptive routing
MoMa	Early-fusion LM	Text, Image	Modality-grouped MoEs	Modality-aware sparse expert choice
MoE3D	3D scene understanding	Geometric, Textual	Top-1 per token	Expert specialization for 3D cues
CAME-AB	Antibody binding prediction	Sequence, Structure	AMF + MoE block	Adaptive modality fusion + contrastive

References

(Raza et al., 17 Jun 2025) NeuroMoE: A Transformer-Based Mixture-of-Experts Framework for Multi-Modal Neurological Disorder Classification
(Li et al., 8 Sep 2025) CAME-AB: Cross-Modality Attention with Mixture-of-Experts for Antibody Binding Site Prediction
(Chen et al., 27 Jan 2025) sDREAMER: Self-distilled Mixture-of-Modality-Experts Transformer for Automatic Sleep Staging
(Chopra et al., 10 Jun 2025) MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding
(Lou et al., 15 Jan 2026) MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
(Shen et al., 2024) MoME: Mixture of Multimodal Experts for Generalist Multimodal LLMs
(Luo et al., 2024) A Large Model for Non-invasive and Personalized Management of Breast Cancer from Multiparametric MRI
(Bao et al., 2021) VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
(Lin et al., 2024) MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
(Li et al., 27 Nov 2025) MoE3D: Mixture of Experts meets Multi-Modal 3D Understanding