Multimodal EHR Integration

Updated 21 November 2025

Multimodal EHR integration is the process of unifying heterogeneous clinical data—including structured entries, text narratives, images, time-series, and genomic features—to enhance precision medicine.
It employs tailored preprocessing pipelines and dedicated neural subnetworks (e.g., CNNs, Transformers, MoE) to encode and optimally fuse different modalities.
Empirical studies indicate that integrated multimodal models consistently outperform unimodal approaches, yielding higher accuracy and robustness in clinical decision support.

Multimodal Electronic Health Record (EHR) integration refers to the theoretical and practical framework for fusing heterogeneous clinical data types—ranging from structured measurements, longitudinal time-series, unstructured clinical narratives, laboratory results, medical images, biologic signals, and genomic features—into unified representations suitable for downstream predictive modeling, clinical decision support, and healthcare analytics. Its central motivation derives from the recognition that single-modality models systematically underutilize the informational richness present in EHRs and fail to capture critical cross-modal dependencies, redundancies, and synergies necessary for precision medicine applications.

1. Modalities, Preprocessing, and Representation Learning

Multimodal EHR integration operates at the intersection of heterogeneous data streams. Canonical modalities include structured tabular data (demographics, vitals, labs, medications, diagnoses, procedures), unstructured text (clinical notes, discharge summaries, radiology reports), static features (age, sex, insurance, comorbidities), time-series (irregularly sampled measurements, visit intervals), and increasingly, imaging data (CT, MRI, X-ray), wearable device streams (physical activity, heart rate, sleep), and genomics (polygenic risk scores, variant carriers).

Preprocessing modalities demands tailored pipelines: for tabular fields, variable selection, missingness handling (e.g., MICE imputation, forward/back fill, or learned token masking), normalization (z-score, min–max), and one-hot/categorical embedding are standard. Unstructured text undergoes tokenization (WordPiece, subword, domain-specific), section-wise concatenation, and possibly conversion of lab events to templated text (e.g., “ITEMID:id: value unit”) (Dao et al., 14 Nov 2025). Imaging streams are resized, intensity-normalized, and embedded via CNN backbones (Jiang et al., 2021). For wearables, clinical-grade aggregation protocols filter and summarize raw signals into multi-day time-indexed feature vectors (Wang et al., 26 Sep 2025). Genomic scores undergo standard QC (GWAS variant pruning, quantile normalization) before projection into neural-embedding spaces (Amar et al., 24 Oct 2025).

Each modality is encoded with a dedicated neural subnetwork: feedforward layers (static or tabular), recurrent or Transformer-based encoders (time-series), pre-trained LLMs (clinical text), CNNs (images), and MLPs or specialized transformers for genomics and wearable time-series. Recent work conjugates all these streams into a shared embedding space of dimension d, either by direct linear projection or via more sophisticated (bi-)directional attention layers and cross-modal interaction modules (Dao et al., 14 Nov 2025, Wang et al., 29 Aug 2025, Cui et al., 2024).

2. Fusion Strategies: Taxonomy and Model Designs

The architecture-level integration of modalities obeys a taxonomy based on the stage at which fusion occurs (Mohsen et al., 2022):

Early fusion (feature-level): Modalities are encoded into feature vectors which are concatenated and fed to an MLP or other classifier (Bagheri et al., 2020, Mohsen et al., 2022). This approach is easy to implement and effective for tabular+text or tabular+image, but often fails to learn cross-modal dependencies when modalities are semantically asynchronous or present at different temporal resolutions.
Intermediate fusion (joint-representation): Each modality is first projected through a dedicated network; a subsequent “fusion” module (attention pooling, bilinear pooling, cross-modal transformers, or self-attention) synthesizes these representations into a joint hidden vector (Yang et al., 2021, Dao et al., 14 Nov 2025, Phan et al., 2024, Cui et al., 2024). Cross-attention is favored for capturing residual dependencies between commensurate modalities (e.g., code-to-note, lab-to-text, tabular-to-image).
Late fusion (decision-level): Separate models produce logit or probability vectors for each modality, which are then combined via voting, learnable gates, or ensemble meta-models. This is prevalent in situations with modality-missingness and when independent training is easier or clinically motivated (Wang et al., 29 Aug 2025, Wang et al., 26 Sep 2025).

State-of-the-art designs now favor dynamic, modality-adaptive fusion mechanisms. Mixture-of-Experts (MoE) frameworks use a learned gating function to select and weight predictions from expert subnetworks, each specialized to a pattern of modality presence (Wang et al., 29 Aug 2025). Disentangled transformation modules explicitly separate shared and specific subspaces (e.g., via mutual information penalties) (Phan et al., 2024). Neural architecture search (NAS) approaches such as AutoFM automate the selection of optimal fusion motifs, confirming the empirical superiority of cross-modal attention followed by attentive aggregation (Cui et al., 2024).

3. Major Model Classes and Mathematical Formalisms

Multimodal EHR integration models instantiate a variety of neural architectures and loss functions, unified by their objective to learn feature representations that optimize predictive performance and sometimes interpretability or uncertainty calibration.

Key model archetypes:

MAG-style attention gating: Modality-adaptive gating selects a “main” stream and modulates it by auxiliary representations via a learned displacement, scaled according to inter-modal norm ratios and a trainable scalar (Yang et al., 2021, Koo, 2024).
Multimodal transformers and cross-attentional networks: Deep fusion via self- and cross-attention is now dominant, allowing information from, for instance, LLM-encoded clinical text to condition and contextually gate time-series or lab inputs (Dao et al., 14 Nov 2025, Phan et al., 2024, Olaimat et al., 31 Jan 2025, Liao et al., 2024).
Domain-informed modules: Medical code-centric models use hierarchical coding ontologies and contrastive objectives to align structured codes, demographics, and free text, enforcing both semantic consistency and hierarchical regularization (Koo, 2024).
Knowledge-augmented and RAG frameworks: Integration with external biomedical knowledge—retrieval of similar-patient cases, or grounding of extracted patient entities in professional knowledge graphs (e.g., PubMed, UMLS, PrimeKG)—can regularize predictions, reduce hallucinations, and improve interpretability (Datta et al., 4 Aug 2025, Zhu et al., 2024, Zhu et al., 2024).
Uncertainty-aware fusion using belief functions: Explicit evidential neural fusion quantifies and propagates modality-specific ignorance and inter-modal evidence conflicts and combines them using Dempster–Shafer theory, yielding calibrated and robust ICU risk stratification (Ruan et al., 8 Jan 2025).
Image-EHR-specific integration: Approaches interleave mid-level EHR-guided spatial attention into ResNet CNNs, and deploy multi-head Gated Multimodal Units (GMU) for joint fusion across learned subspaces, with demonstrable diagnostic accuracy improvements (Jiang et al., 2021, Mohsen et al., 2022).

Loss functions are selected according to task: binary or multi-label cross-entropy for risk or diagnosis classification, multi-label hinge for chronic disease prediction, mean-squared error for next-step regression objectives, contrastive losses for patient-level alignment, and composite objectives with uncertainty regularization (Dao et al., 14 Nov 2025, Ruan et al., 8 Jan 2025, Phan et al., 2024, Koo, 2024).

4. Empirical Outcomes, Ablations, and Benchmark Comparisons

A pervasive empirical finding is that multimodal fusion consistently outperforms unimodal baselines by absolute margins of 2–12 points in accuracy, recall@k, AUROC, or F1, depending on the modality combination and the fusion strategy (Yang et al., 2021, Mohsen et al., 2022, Dao et al., 14 Nov 2025, Huang et al., 14 Aug 2025, Phan et al., 2024, Zhu et al., 2024, Wang et al., 29 Aug 2025). Notable results include:

CURENet achieves F1_macro = 0.855 (MIMIC-III) in 10-disease prediction, with “w/o TEXT” ablation dropping F1_macro from 0.855→0.663, verifying that narrative modalities are indispensable for peak clinical performance (Dao et al., 14 Nov 2025).
EMERGE and REALM attain AUROC 86.22–89.5 and significant robustness to data sparsity, demonstrating that RAG-enhanced integration of temporal, textual, and KG-derived features yields additive gains over baseline multimodal and knowledge-only approaches (Zhu et al., 2024, Zhu et al., 2024).
MoE-Health shows improved mortality AUROC 0.818 (full trimodal integration) versus 0.770 (EHR only), with graceful degradation when one or more modalities are missing (Wang et al., 29 Aug 2025).
MINGLE demonstrates a relative F1 gain of 11.8% over pure hypergraph approaches on MIMIC-III phenotyping—a result attributed to LLM-derived semantic infusions at both node and hyperedge levels (Cui et al., 2024).
MEDFuse achieves >0.90 macro-F1 on multi-label diagnosis, outperforming both text-only and lab-only backbones; ablation underscores the necessity of each fusion component (Phan et al., 2024).

Crucially, ablation studies uniformly reveal that text modalities (clinical notes, narratives) contribute the largest marginal improvements in multisource scenarios, while failure to properly fuse (e.g., by naive concatenation) leaves cross-modal signal untapped (Yang et al., 2021, Phan et al., 2024).

5. Challenges, Limitations, and Open Problems

Several substantial challenges delimit the state-of-the-art in multimodal EHR integration:

Modality-missingness: Real-world clinical data incompleteness is the norm. Approaches like top-k expert MoE (Wang et al., 29 Aug 2025) and explicit missing-indicator embeddings outperform imputation-centric baselines, but general missingness-robustness in high-dimensional regimes remains poorly characterized.
Fusion complexity and overfitting: Increasing fusion sophistication demands larger datasets to avoid overfitting, especially for deep joint-encoding and high-dimensional cross-attentional models (Mohsen et al., 2022).
Computational cost: Models that cross-attend O(M²) modality pairs, operate on long sequences, or invoke large LLMs or KGs incur substantial runtime and memory burdens (Datta et al., 4 Aug 2025, Zhu et al., 2024).
Explainability and clinical trust: While attention heatmaps and natural language rationales improve interpretability (Datta et al., 4 Aug 2025, Zhu et al., 2024), clinically actionable explanations for fusion models, especially when knowledge-driven reasoning chains are incorporated, are not standardized.
Modality selection and scalability: Few systems adaptively select which modalities to fuse on a per-patient or per-task basis. NAS frameworks and MoE routing offer initial solutions but entail additional design and computation complexity (Wang et al., 29 Aug 2025, Cui et al., 2024).

A plausible implication is that further development should focus on adaptive, data- and context-driven fusion mechanisms, explainability modules, and modular pipelines capable of plug-and-play integration of new data streams.

6. Extensions: Emerging Modalities and Future Directions

Recent expansions of fusion frameworks underscore key emerging directions:

Integration with genomics: Foundation models can now project polygenic risk scores into the transformer attention space, allowing joint modeling of static genetic propensity and longitudinal EHR data. This has produced significant gains in early disease prediction (e.g., +0.025 AUROC for T2D 10-year risk) (Amar et al., 24 Oct 2025).
Wearable data streams: Coupling 180-day consumer wearable summaries with structured OMOP EHR features achieves up to +12.2% delta AUROC for diabetes, indicating foundations for holistic, person-centric trajectory modeling (Wang et al., 26 Sep 2025).
Automated fusion discovery: NAS methods such as AutoFM automatically determine both stream-specific encoders and fusion graph topology, outperforming hand-designed baselines and exposing generalizable architectural motifs (e.g., early cross-attention, attentive late aggregation) (Cui et al., 2024).
Uncertainty-quantified predictions: Belief function theory-based fusion provides calibrated, reliable ICU predictions, outperforming typical deterministic or softmax-calibrated models in Brier score and negative log-likelihood (Ruan et al., 8 Jan 2025).

Open research questions include: generalizing to multi-center and multi-national data, federating multimodal learning for privacy, incorporating richer environmental, economic, and social determinants, and developing real-time and streaming fusion mechanisms in clinical settings.

Summary table of integration strategies and their prototypical mathematical formulas:

Fusion Strategy	Core Operation	Representative Work
Early (feature-level)	$h = [f_E(x_{EHR})\\|\ f_I(x_{IMG})]$	(Mohsen et al., 2022, Bagheri et al., 2020)
Cross-modal attention (intermediate)	$h^{fusion} = \mathrm{Attn}(Q=Z_A, K=Z_B, V=Z_B)$	(Yang et al., 2021, Dao et al., 14 Nov 2025, Phan et al., 2024)
MAG-style gating	$M = E_{main} + \alpha H$	(Yang et al., 2021, Koo, 2024)
Knowledge-augmented fusion	$H^{concat} = [X_{struct}\\|\ H^{KG}\\|\ H^{reason}]$	(Datta et al., 4 Aug 2025, Zhu et al., 2024)
MoE (top-k gating; missing-robust)	$\hat y = \sum_{j\in\mathcal T_k} \bar g_j f_j(R)$	(Wang et al., 29 Aug 2025)
Belief-function fusion	$m_{final} = m^{(s)} \oplus m^{(n)}$ , $p(\omega) = \sum_{A\ni\omega} m(A)/\|A\|$	(Ruan et al., 8 Jan 2025)
Hypergraph+LLM codes+text graph	Node: $[\bm s_v;\bm c_v]$ ; Edge: $\mathrm{MLP}_2([...; \bm H_e])$	(Cui et al., 2024)

By systematically extracting, encoding, and adaptively fusing heterogeneous clinical data, multimodal EHR integration enhances risk prediction, chronic disease management, and the clinical interpretability of complex patient states. Modular, explainable, and missingness-robust integration frameworks define the current frontier, with promising evidence for further improvements as more sophisticated, scalable, and auto-discovered fusion architectures, knowledge augmentation, and novel data modalities are incorporated.