Uncertainty-Aware Foundation Models for Clinical Data

Published 5 Apr 2026 in cs.LG | (2604.04175v1)

Abstract: Healthcare foundation models have largely followed paradigms from natural language processing and computer vision, emphasizing large scale pretraining and deterministic representations over heterogeneous clinical data. However, clinical observations are inherently incomplete, reflecting sparse, irregular, and modality dependent measurements of an underlying physiologic state. In this work, we propose a framework for uncertainty aware foundation modeling that represents each patient not as a point embedding, but as a distribution over plausible latent states. By learning set valued representations and enforcing consistency across partial views of the same patient, the model captures what is invariantly inferable while explicitly encoding epistemic uncertainty. We integrate this formulation with multimodal encoders and scalable self supervised objectives, combining reconstruction, contrastive alignment, and distributional regularization. Across diverse clinical tasks, our approach improves predictive performance, robustness under missing data, and uncertainty calibration relative to strong baselines. These results suggest that modeling what is not observed rather than only what is constitutes a critical inductive bias for healthcare foundation models.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents a novel framework that models patients as probability distributions to explicitly capture uncertainty from incomplete clinical data.
It employs multivariate Gaussian encoders with partial-view consistency and modality-specific aggregation to robustly fuse multimodal inputs.
The approach achieves superior predictive performance and calibration on clinical tasks, as demonstrated through experiments on the MIMIC dataset.

Uncertainty-Aware Foundation Models for Clinical Data: Distributional Representations for Incomplete, Multimodal Observations

Motivation and Conceptual Shift

This work challenges the capacity-centric paradigm of recent healthcare foundation models by foregrounding structural incompleteness as a primary obstacle to robust clinical AI. In current approaches, patient representations are typically deterministic (single point embeddings), ignoring inherent uncertainties induced by partial, irregular, and multimodal clinical data. The authors propose that, in clinical settings, the relevant inductive bias should be the explicit modeling of uncertainty—representing each patient as a distribution over plausible latent physiologic states, rather than a single vector fitted naively to sparse and heterogeneous evidence.

Methodological Framework

The model architecture builds latent set-valued representations. Given a partially observed multiview clinical input $x$ , the encoder outputs a probability distribution $q_\theta(z | x)$ over an abstract latent state $z$ . For tractability, the implementation uses multivariate Gaussians parameterized by modality-conditioned mean $\mu_\theta(x)$ and covariance $\Sigma_\theta(x)$ , but this framework generalizes to more complex families.

Key methodological innovations include:

Partial view consistency: Training enforces that different incomplete observations of the same patient (e.g., distinct subsets of modalities, timepoints, or masked content) yield compatible posteriors in latent space. A symmetrized KL or Wasserstein divergence regularizes the distance between $q_\theta(z|x^{(1)})$ and $q_\theta(z|x^{(2)})$ , promoting invariance to missingness patterns.
Multimodal encoders and aggregation: Each modality is encoded via a domain-appropriate module (transformers for sequential data, ViT for imaging, LLMs for text), with a permutation-invariant aggregator ( $\mathcal{A}$ ) fusing available representations while handling arbitrary modality subsets and their availability masks at both train and test time.
Self-supervised learning objective: The pretraining loss combines reconstruction (masked token/patch/event prediction conditioned on $z$ ), partial-view consistency regularization, and latent regularization (e.g., KL-to-prior or shrinkage constraints for $\Sigma_\theta(x)$ ). An additional hybrid contrastive geometry term improves patient-level separability while preserving distributional calibration.
Uncertainty propagation and calibration: Downstream tasks use Monte Carlo marginalization over $q_\theta(z | x)$ 0, with uncertainty measures (predictive entropy, posterior variance) available for abstention, risk assessment, or selective prediction.
Training with stochastic view sampling: To reflect natural sparsity and heterogeneity, training repeatedly samples random partial views from full records, forcing representations stable to observation patterns and robust to real-world missingness regimes.

Empirical Evaluation

Predictive Performance

Across binary and multiclass clinical prediction tasks (e.g., in-hospital mortality, readmission, risk estimation) on the MIMIC database—spanning structured EHR, multimodal records (notes, imaging), and physiological waveforms—the distributional model achieves superior metrics compared to deterministic embeddings and strong baselines (masked autoencoders, contrastive learners, autoregressive models):

AUROC: 0.861 (distributional) vs. 0.846 (deterministic), 0.828 (contrastive), 0.835 (autoregressive)
AUPRC: 0.536 (distributional) vs. 0.517 (deterministic), 0.491 (contrastive)
Mean Squared Error: 1.61 (distributional) vs. 1.72–1.92 (others)
C-index for survival: 0.729 vs. ≤0.713

Robustness to Missingness

Test-time ablations progressively mask input modalities up to 75%, reflecting catastrophic missingness typical in real clinical deployments. While all models degrade, the decay of the distributional variant is substantially slower: performance at 75% missingness remains at 0.772 AUROC, compared to 0.743 (deterministic) and 0.705 (contrastive). This supports the claim that consistent set-valued latent encoding of uncertainty provides a more stable substrate under systematic sparsity.

Calibration

Expected calibration error (ECE) and negative log-likelihood (NLL) metrics demonstrate that the distributional approach yields substantially better-calibrated predictions—ECE is reduced from 0.068 (deterministic) to 0.041 (distributional), a nearly 40% improvement versus baselines.

Ablations

Removing the partial-view consistency penalty diminishes both AUROC and calibration, confirming its necessity for cross-view robustness.
Removing the distributional component (using deterministic embeddings) reduces accuracy and increases miscalibration.
The contrastive geometry term contributes additional marginal gains for hard imbalanced classification regimes.

Representation Analysis

Posterior covariance ( $q_\theta(z | x)$ 1) strongly correlates with prediction difficulty and ambiguity, confirming that the uncertainty is not merely a side-effect of optimization but is semantically aligned with epistemic limits imposed by data incompleteness. Cross-view MMD analysis confirms that the latent geometry is robustly aligned across modalities and view patterns.

Implications and Future Directions

This framework operationalizes the modeling of epistemic uncertainty induced by partial clinical observation as a first-class representational property for healthcare foundation models. The approach directly decouples representation fidelity from observation density, allows handling of variable/unknown multimodal input availability at inference, and provides calibrated uncertainty estimates crucial for clinical decision support.

Theoretical implications:

Highlights the limitation of standard scaling paradigms that solely increase data/model size without addressing the fundamental underdetermination of clinical inference.
Argues for a representational hierarchy grounded on set-valued or distributional semantics, potentially informing future unsupervised/self-supervised representation learning beyond healthcare.
Enriches self-supervision, shifting from masked or contrastive reconstruction to explicit uncertainty-matching constraints across observation patterns.

Practical implications:

Provides a unified and principled mechanism for robust transfer and deployment over heterogeneously collected, incomplete clinical datasets—critical for EHR, biosignal, and multimodal health applications.
Facilitates selective prediction and abstention, potentially reducing harm from overconfidence in ambiguous or out-of-distribution cases.
Directly handles systematic missingness without ad-hoc imputation.

Challenges and Open Problems:

Extension beyond Gaussian posteriors to richer or implicit distributions for multimodal latent spaces.
Disentangling diverse sources of uncertainty: distinguishing epistemic uncertainty from label noise, temporal nonstationarity, and distributional shifts.
Scaling computationally to very high-dimensional, multimodal, or longitudinal clinical records with efficient inference.
Integrating uncertainty-aware representations into end-to-end clinical decision-making and reasoning systems, potentially through hierarchical or modular architectures.

Conclusion

This work provides a rigorous foundation for uncertainty-aware representation learning in clinical AI, demonstrating strong numerical gains in performance, robustness, and calibration over leading baselines. By treating clinical data as partial instantiations of an unobserved physiologic system and enforcing latent distributional consistency, it establishes a paradigm shift toward inductive biases that directly encode what is—not—known. This approach offers a robust pathway for developing future foundation models that are not only scalable but also epistemically aligned with the realities of complex, incomplete, and multimodal clinical data (2604.04175).

Markdown Report Issue