Multimodal Clinical Foundation Models
- Multimodal clinical foundation models (MCFMs) are large-scale neural networks that integrate heterogeneous clinical data—including imaging, EHR, genomics, and physiological signals—to generate robust, transferable representations.
- They employ pretraining strategies such as contrastive, masked, and generative objectives to learn modality-invariant features that facilitate rapid task-specific fine-tuning.
- MCFMs enable diverse clinical applications like diagnosis, prognosis, segmentation, and report generation by efficiently fusing multimodal data and demonstrating high performance on key benchmarks.
A multimodal clinical foundation model (MCFM) is a large-scale, pre-trained deep neural network designed to learn robust, transferable representations from heterogeneous clinical data spanning multiple modalities. MCFMs are trained on millions of samples—potentially across imaging, time-series, structured EHRs, free text, genomics, and physiological signals—using proxy, contrastive, and/or generative objectives agnostic to specific downstream tasks. After pretraining, these models can be rapidly adapted via fine-tuning or modular adaptation to a broad spectrum of clinical applications, including diagnosis, prognosis, prediction, segmentation, report generation, and multimodal reasoning. This article provides a comprehensive overview, technical taxonomy, representative architectures, and emerging research themes across the current landscape of multimodal clinical foundation models.
1. Multimodal Data Modalities and Representations
Modern MCFMs are engineered to integrate the diverse and complex data modalities found in clinical environments:
- Medical images: 2D modalities (X-ray, dermoscopy, fundus, mammography, pathology tiles) and volumetric/3D data (CT, MRI, PET, ultrasound). Common preprocessing involves patchification (e.g., 16×16 for ViT) and linear embedding with positional encoding. Volumes are split into cubes or treated as videos with temporal attention (Sun et al., 3 Dec 2024, Dai et al., 31 May 2025).
- EHR and clinical time-series: Structured features (demographics, labs, vitals) are normalized; irregular time series (e.g., ICU labs) are handled by aggregation, transformers, or GRUs; free-text notes are tokenized and mapped by LLMs (Yu et al., 20 Jul 2025, Chen et al., 2023).
- Genomics and omics: Sequences and polygenic risk scores (PRS) are embedded via MLPs, 1D transformers, or specialized encoders (e.g., BulkRNABert, Universal Cell Embedding), enabling patient-level genomic risk stratification (Amar et al., 24 Oct 2025, Song et al., 12 May 2025).
- Physiological waveforms: ECG/EEG data are segmented, embedded via 1D convolutions or transformers, and integrated as temporal tokens.
- Multimodal synthetic data: Emerging generative models (e.g., XGeM) enable any-to-any synthesis of clinically aligned multimodal data for data augmentation and system evaluation (Molino et al., 8 Jan 2025).
2. Pretraining Objectives and Datasets
MCFMs employ large-scale, self-supervised pretraining to extract modality-invariant, semantically enriched representations:
- Contrastive learning: Cross-modal contrastive objectives align modalities by maximizing agreement between paired samples (e.g., image–text [CLIP/MedCLIP], clinical–imaging [MEDFORM]) (Sun et al., 3 Dec 2024, Jung et al., 22 Jan 2025).
- Masked modeling: Masked image modeling (MAE, SimCLR, SimMIM), masked language modeling (MLM for text), and cross-modal masking (e.g., both image patches and text tokens) (Sun et al., 3 Dec 2024, Farahani et al., 10 Mar 2025).
- Generative and hybrid proxies: Proxy tasks include image/patch/volume reconstruction (autoencoder, MAE, M3AE), segmentation proxies (mask prediction), and hybrid losses combining discriminative and restorative signals (e.g., DIRA, DAE) (Sun et al., 3 Dec 2024).
- Multitask pretraining: Large-scale integrated datasets (e.g., CLIMB, 4.51M samples, 22.9% multimodal cases) are used to pretrain universal encoders per modality, with task sampling balanced to avoid domination by abundant sources (Dai et al., 9 Mar 2025).
- Data sources: Pretraining leverages massive, unified, and/or domain-curated corpora, e.g., MedMD (16M multimodal pairs) (Wu et al., 2023), CLIMB (Dai et al., 9 Mar 2025), MIMIC-IV (Yu et al., 20 Jul 2025), and specialty datasets (MerMED-FM: 3.3M images, 7 modalities) (Zhou et al., 30 Jun 2025), PanDerm (2.1M dermatology images, 4 modalities) (Yan et al., 19 Oct 2024).
3. Model Architectures and Fusion Mechanisms
Several architectural paradigms have been established for multimodal fusion and foundation modeling:
| Paradigm | Modality Encoders | Fusion and Output |
|---|---|---|
| Dual-encoder (CLIP) | Separate for each | Shared latent space via contrastive loss (Sun et al., 3 Dec 2024, Wu et al., 2023) |
| Multimodal Transformer | Shared/self-attention over all tokens | Deep interleaving via cross-modal attention (full/partial stacking) (Dai et al., 31 May 2025, Mohsin et al., 2 Oct 2025) |
| Q-Former/Adapters | Modality-specific adapters/LoRA | Gating, compression, parameter-efficient modality addition (Peng et al., 30 Jan 2025, Qazi et al., 14 Nov 2025) |
| Modular/Skill-based | Frozen foundation, plug-in modules | Task/modality-specific LoRA/MLP heads, resolution adapters (Qazi et al., 14 Nov 2025) |
| Memory-augmented SSL | Single ViT backbone (vision) | Memory bank for cross-modal negatives (Zhou et al., 30 Jun 2025) |
Fusion may occur at different levels:
- Early fusion: Concatenation of input channels/modalities (rare except for multi-contrast imaging).
- Joint/Intermediate fusion: Cross-attention layers allow inter-modality conditioning at intermediate network depths (Dai et al., 9 Mar 2025, Dai et al., 31 May 2025, Mohsin et al., 2 Oct 2025).
- Late fusion: Separate modality encoders; outputs fused via concatenation, weighted sum, gating networks, or ML classifiers (Yu et al., 20 Jul 2025, Song et al., 12 May 2025).
- Self-gating: Adaptive compression of multiple query streams for parameter-efficient continual learning (Peng et al., 30 Jan 2025).
4. Clinical Applications and Evaluation
MCFMs achieve strong results across a spectrum of clinically relevant tasks and benchmarks:
- Classification: Disease, multi-class and rare condition prediction, risk stratification, and multi-task screening (e.g., MerMED-FM achieves AUROC 0.988 on OCT, 0.951 US, 0.943 CT, 0.894 fundus, 0.931 skin) (Zhou et al., 30 Jun 2025, Yan et al., 19 Oct 2024).
- Segmentation and localization: Organ/tumor/lesion segmentation (e.g., MedSAM, MTS-UNET Dice ≈84% for glioma; Citrus-V: Dice up to 92% for dermoscopy, +5–40 points vs. prior expert models) (Farahani et al., 10 Mar 2025, Wang et al., 23 Sep 2025, Dai et al., 31 May 2025).
- Prognosis and survival: Late fusion of FM-derived embeddings yields C-indices up to 0.795 for TCGA (survival), consistent improvements in cancer, and syndrome prognosis (Song et al., 12 May 2025, Peng et al., 30 Jan 2025).
- Multimodal reasoning: Visual question answering (VQA), medical report generation, and chain-of-thought clinical inference; models like Citrus-V and EVLF-FM provide grounded, stepwise decision outputs and rationales (Wang et al., 23 Sep 2025, Bai et al., 29 Sep 2025).
- Data synthesis: Generative models (e.g., XGeM) synthesize heterogeneous outputs conditioned on arbitrary modality subsets, enabling augmentation and anonymization (Molino et al., 8 Jan 2025).
- Treatment planning and robotics: Endoscopic/robotic video/detection, dose planning, and intraoperative guidance via joint reasoning over video, EHR, and knowledge graphs (Sun et al., 3 Dec 2024).
5. Adaptation, Modularization, and Continual Learning
To address evolving clinical needs and data distributions, recent MCFMs implement flexible, modular adaptation and lifelong learning strategies:
- Modular adaptation: Lightweight LoRA adapters, MLP heads, or plug-in modules extend a frozen foundation backbone, supporting new modalities (PET, genomics, EHR), tasks (prognosis, segmentation, detection), or clinical domains (MAFM³, modular adapters, +5% Dice for PET+CT) (Qazi et al., 14 Nov 2025).
- Continual learning: Parameter-efficient addition of modalities without catastrophic forgetting, using fixed Q-Former parameters and only learning new adapter weights (CREMA: +0.042–0.081 C-index via new modalities) (Peng et al., 30 Jan 2025).
- Data efficiency: Foundation models demonstrate strong label efficiency, achieving near state-of-the-art with ≤10% of fine-tuning labels (PanDerm, MerMED-FM) (Yan et al., 19 Oct 2024, Zhou et al., 30 Jun 2025).
- Robustness to missing modalities: Full-modality masking (MM-DINOv2) and modular dropout confer resilience to incomplete records, crucial for real-world deployment (Scholz et al., 8 Sep 2025).
6. Technical Challenges and Open Directions
Critical challenges persist for deploying and further advancing multimodal clinical foundation models:
- Data heterogeneity and incompleteness: Multi-institutional data silos, missing modalities, and non-standardized formats require advanced domain adaptation and dynamic modality completion (Sun et al., 3 Dec 2024, Chen et al., 2023).
- Interpretability and bias: Visual grounding, chain-of-thought reasoning, and attribution methods (attention maps, SHAP values, pixel-level localization) are required for regulatory compliance, clinician trust, and bias detection (Bai et al., 29 Sep 2025, Yu et al., 20 Jul 2025).
- Scalability and sustainability: Training and inference costs remain prohibitive for large models; compression, efficient adapters, and hardware-efficient designs are active research areas (Sun et al., 3 Dec 2024, Qazi et al., 14 Nov 2025).
- Privacy, governance, regulation: Differential privacy, federated learning, audit trails, and data provenance tools are being integrated to comply with evolving health-care regulations (Mohsin et al., 2 Oct 2025).
- Generalization and transfer: Zero- and few-shot transfer capabilities are increasingly demonstrated (CLIMB, MerMED-FM), but systematic benchmarking for cross-clinic, cross-population generalization remains an open need (Dai et al., 9 Mar 2025, Zhou et al., 30 Jun 2025).
7. Future Directions
Emerging trends and research opportunities include:
- Unified, specialty-agnostic assistants: Models such as QoQ-Med provide unified reasoning across 1D–3D clinical data, incorporating specialty-specific adaptation while retaining a global backbone (Dai et al., 31 May 2025).
- Multimodal generative modeling: Models like XGeM promise flexible, consistent, privacy-preserving synthesis for rare case augmentation, counterfactual analysis, and anonymization (Molino et al., 8 Jan 2025).
- Translational impact and integration: Interfacing MC-FMs with electronic health records, radiology workstations, and point-of-care systems—ensuring prospective validation, regulatory auditing, and human-in-the-loop usage—remains a high-priority domain for translational research (Sun et al., 3 Dec 2024, Yu et al., 20 Jul 2025).
- Extension to new modalities: Integration of omics, wearable sensor data, robotics, audio, and dynamic monitoring will expand the clinical reach of these models (Mohsin et al., 2 Oct 2025, Amar et al., 24 Oct 2025).
- Adaptive, task-conditioned fusion: Development of dynamic fusion architectures that adjust to task complexity, available modalities, and clinical context is ongoing (Dai et al., 9 Mar 2025).
MCFMs are expected to catalyze the next wave of precision medicine and real-time, reliable, and interpretable AI decision support—assuming continued progress in technical, regulatory, data, and clinical integration domains.