MoMA for Multimodal Clinical Prediction

Updated 22 February 2026

MoMA for Multimodal Clinical Prediction is a modular framework that integrates heterogeneous clinical data using specialized encoders for text, images, tabular records, and molecular profiles.
It employs a dynamic gating mechanism to route patient-specific information through expert pathways, achieving state-of-the-art performance and resilience to missing data.
By translating non-text modalities into a unified language space, MoMA systems provide robust, data-adaptive predictions across diverse clinical settings.

MoMA for Multimodal Clinical Prediction refers to a class of architectural frameworks that harness mixtures of modality-specialized components—often realized as agents, experts, or adapters—to fuse heterogeneous clinical data (text, images, tabular records, molecular representations) for robust, data-adaptive clinical prediction. These frameworks unify representations from disjoint modalities and route patient-specific information through dedicated expert pathways or modular agent ensembles, achieving state-of-the-art predictive performance alongside practical resilience to incomplete or heterogeneous input patterns. MoMA-inspired systems address the growing need to extract actionable clinical insights from complex multimodal datasets, including electronic health records (EHR), radiology, pathology, molecular, and free-text clinical notes.

1. Fundamental Architectural Principles

The defining principle of MoMA (Mixture-of-Multimodal-Agents/Experts) architectures is the modular decomposition and flexible fusion of patient data streams:

Specialized Encoders or Agents: Each modality—EHR timeseries, narrative text, medical images, molecular graphs—is mapped to a vectorial representation via domain-optimized encoders (e.g., ClinicalBERT for notes, DenseNet/CXR-LLAVA for images, ChemBERTa for SMILES).
Mixture-of-Experts/Agents Routing: A gating (routing) mechanism—usually a learned neural network—dynamically selects and weights experts according to available modalities and patient context. Top- $k$ or sparse routing enables scalable integration and specialization.
Unified or Common Language Space: For cross-modal alignment, several MoMA frameworks translate non-text modalities into text via LLM “specialist agents.” This textualization enables downstream LLMs or transformers to process and aggregate evidence uniformly.
Flexible Fusion and Aggregation: The outputs of modality experts/agents are combined through a secondary fusion mechanism (e.g., shallow feedforward head, concatenation, token-based aggregation), producing a single downstream patient representation for prediction.

Architectures realize these principles variously through sequential LLM agents (Gao et al., 7 Aug 2025, Aparício et al., 26 Dec 2025, Zheng et al., 2024), end-to-end transformer-based MoE layers (Wang et al., 29 Aug 2025), or selective adapter-based fusion networks (Lee et al., 13 Mar 2025).

2. Core Methodologies

2.1 Modality-Specific Processing

Across MoMA systems, each input modality is independently processed:

Text: Domain-specific transformer encoders (e.g., ClinicalBERT, Llama-3), after careful preprocessing of notes and reports.
Images: Convolutional neural networks (DenseNet-121, CLIP-backbone), foundation vision models pre-trained on large clinical corpora.
Tabular/Lab: Embedding layers for structured fields, sometimes summarized via LLM prompts.
Molecular/Graphs: Specialized models (ChemBERTa) for molecules, including transformer-based SMILES embeddings.

Textualization is widely used to unify representations, where LLMs transform image, lab, molecular, or tabular results into narrative summaries (“schema-guided textualization”) (Gao et al., 7 Aug 2025, Aparício et al., 26 Dec 2025, Zheng et al., 2024).

2.2 Mixture Routing and Expert Fusion

MoMA Variant	Routing Mechanism	Fusion Scheme	Specialization Principle
MoE-Health (Wang et al., 29 Aug 2025)	Top- $k$ gating MLP over modality embeddings	Scalar-weighted sum, top- $k$ experts	Expert per subset of modalities
MoMA (agents) (Gao et al., 7 Aug 2025)	LLM aggregation over specialist summaries	Unified text summary, LLM prediction	LLM per modality, LLM aggregator
MMCTOP (Aparício et al., 26 Dec 2025)	Drug/disease-conditioned sparse MoE	Top-2 expert fusion	Schema-aware narrative, transformer SMoE
M4Survive (Lee et al., 13 Mar 2025)	Selective state-space Mamba adapter	Sequential token aggregation	Adapter per modality, dynamic fusion
LIFTED (Zheng et al., 2024)	Intra- and cross-modality sparse MoE	Hierarchical weighted integration	SMoE per textified modality, meta fusion
MoMa (pretrain) (Lin et al., 2024)	Hierarchical, modality-aware token router	Expert group per modality, early fusion	MoE by token type (text/image)

Key mathematical components include:

Gating logits and top- $k$ expert selection: $g = \mathrm{Softmax}(W_g \cdot \mathrm{ReLU}(W_R R + b_R) + b_g)$
Weighted expert fusion: $\hat{y} = \sum_{j\in T_k} g_j \cdot E_j(R)$
Sequential state-space fusion (Mamba): $h_{n} = A_d h_{n-1} + B(t_n)t_n$ , $y_n = C(t_n)h_{n-1}$

2.3 Handling Heterogeneous and Missing Modalities

MoMA frameworks address practical clinical heterogeneity via:

Learnable Missing Indicators: Per-modality embeddings replace encoder outputs for absent data, training the model to recognize missingness patterns (Wang et al., 29 Aug 2025).
Sequential Skipping/Omission: Token/state-based architectures (Mamba, agent pipelines) naturally omit absent modalities during inference, processing only available sources (Lee et al., 13 Mar 2025).
Flexible Prompt Engineering: Specialist LLM agents adapt prompt templates per patient for only the observed modalities (Gao et al., 7 Aug 2025).

3. Training Strategies and Loss Design

MoMA frameworks exploit supervised task-appropriate objectives, regularization for load balancing, and calibration enhancements:

Task Loss: Binary cross-entropy for classification tasks (mortality, readmission, phase outcome), Cox partial-likelihood for survival risk (Wang et al., 29 Aug 2025, Aparício et al., 26 Dec 2025, Lee et al., 13 Mar 2025).
Load-Balancing Regularizers: Prevent expert collapse with coefficient of variation or importance weighting, e.g., $L_\mathrm{balance} = \alpha \cdot CV(f \odot \bar{p})$ (Wang et al., 29 Aug 2025, Aparício et al., 26 Dec 2025).
Consistency or Augmentation Losses: Inject random perturbations into input embeddings, encouraging robust representations; pairwise consistency objectives for multi-agent agreement (Zheng et al., 2024, Nguyen et al., 2022).
Probability Calibration: Temperature scaling of logits to ensure output probabilities match empirical frequencies (Aparício et al., 26 Dec 2025).

Optimization typically follows AdamW with low learning rates, sometimes leveraging mixed-precision or parameter-efficient fine-tuning (e.g., LoRA adapters for LLM predictors) (Gao et al., 7 Aug 2025).

4. Empirical Performance and Comparative Evaluation

MoMA systems report consistent advances against unimodal, naive multimodal, and classical fusion baselines, with thorough ablations:

MoE-Health (Wang et al., 29 Aug 2025): On MIMIC-IV, achieves AUROC of 0.818 and F1 of 0.465 for in-hospital mortality prediction, surpassing prior fusion architectures (TriMF AUROC 0.806, F1 0.435).
MoMA (LLM agents) (Gao et al., 7 Aug 2025): Macro-F1 of 0.834 (95% CI 0.806–0.861) for trauma severity, significant gains over LLaVA-Med and ClinicalBERT; improvements hold across sex/race subgroups.
MMCTOP (Aparício et al., 26 Dec 2025): Phase II trial outcome prediction—precision 68.22% (vs. 61.20% HINT), AUC 58.74% (vs 52.7%).
M4Survive (Lee et al., 13 Mar 2025): Survival C-index 81.27 ± 0.56, up by 5–6 points over previous state-of-the-art.
LIFTED (Zheng et al., 2024): Phase III PR-AUC 88.3% (vs. 81.7% state-of-the-art baseline), F1 83.8% (vs 81.0%).

Ablation studies confirm that omitting modality-specialized experts, gating, or cross-modal fusion consistently reduces predictive performance or calibration quality. Notably, pretraining expert models for each modality combination is crucial and removing this specialization can degrade AUROC by up to 8.3 points (Wang et al., 29 Aug 2025).

5. Practical Implementation and Limitations

Implementation of MoMA frameworks involves:

Instantiating or adapting specialist encoders (LLMs, CNNs, foundation models) for each clinical modality.
Engineering prompts and pipelines for LLM-based agents, or constructing modular transformer/gating layers.
Pretraining on samples stratified by observed modality combinations; fine-tuning classifier heads for target tasks.
Scaling agent-based inference using batching and output caching to mitigate LLM-induced latency (Gao et al., 7 Aug 2025).

Limitations include:

Scalability challenges as the number of modalities increases ( $2^M-1$ expert combinations in naïve designs) (Wang et al., 29 Aug 2025).
Potential fragility in LLM-generated summaries or prompt engineering, with risk of hallucination or suboptimal summarization (Gao et al., 7 Aug 2025).
Nontrivial interpretability of dynamic gating decisions and fused latent spaces.
Compute and memory overhead in large-scale agent or MoE deployments, though recent advances in parameter-efficient tuning and routing sparsity partially alleviate this (Lin et al., 2024).

6. Extensions, Open Directions, and Comparative Context

MoMA approaches enable rapid extension to new clinical modalities (e.g., genomics, time-series labs) by defining new agents, encoders, or prompt templates without end-to-end retraining (Gao et al., 7 Aug 2025, Aparício et al., 26 Dec 2025, Lee et al., 13 Mar 2025). Unified transformer-based alternatives (such as IRENE (Zhou et al., 2023)) offer holistic, attention-based fusion but may not provide the same modularity or adaptability to missingness.

Ongoing research seeks to:

Learn modal-adaptive gating weights between agents instead of rigid concatenation (Gao et al., 7 Aug 2025).
Integrate reinforcement learning for improved agent coordination.
Employ retrieval-augmented or schema-grounded generation to better control LLM-based summaries (Aparício et al., 26 Dec 2025).
Generalize dynamic adapter or MoE fusion strategies to streaming or temporal, real-time clinical scenarios (Lee et al., 13 Mar 2025, Nguyen et al., 2022).
Optimize foundation model fusion for resource efficiency (e.g., quantized, sparse, or early-exit architectures) (Lin et al., 2024).

MoMA systems now constitute a central paradigm for state-of-the-art multimodal clinical prediction, combining modularity, data-adaptive routing, and scalable fusion to meet the demands of modern, heterogeneous healthcare data environments.