Papers
Topics
Authors
Recent
2000 character limit reached

Pathology Foundational Models (PFMs)

Updated 22 December 2025
  • Pathology Foundational Models (PFMs) are versatile AI systems based on vision transformer architectures that analyze gigapixel whole-slide images and clinical data.
  • They achieve state-of-the-art performance in tasks like cancer subtyping, survival prognosis, and biomarker expression with high accuracy metrics.
  • PFMs face challenges including domain shifts and adversarial vulnerabilities while paving the way for integrated, multimodal medical AI platforms.

Pathology Foundational Models (PFMs) are a class of large-scale, general-purpose deep learning models designed to process and analyze whole-slide images (WSIs) and associated clinical data in the field of pathology. By leveraging self-supervised and weakly supervised learning on vast collections of pathology images—and in some cases, paired text or molecular information—PFMs have shifted the paradigm from narrowly scoped, task-specific AI to versatile, adaptable systems capable of addressing a broad array of diagnostic, prognostic, and biomarker quantification tasks. This emergence has catalyzed new benchmarks in computational pathology performance, highlighted ongoing challenges related to domain adaptation, robustness, and interpretability, and set the agenda for future integration into comprehensive medical AI platforms (Ochi et al., 31 Jul 2024).

1. Architectural Principles and Pretraining Strategies

PFMs are primarily based on vision transformer (ViT) architectures, often augmented with convolutional stems or custom slide-level aggregators to handle gigapixel tissue images. Model scale ranges from ~86 million parameters (ViT-Base, e.g. CTransPath) up to several billion (e.g., Prov-GigaPath at 600M–1B, and multimodal variants). A defining methodological shift is the reliance on self-supervised objectives such as masked image modeling (MIM)—where the goal is to recover masked visual tokens:

LMIM=tTlogp(xtxt)L_{\mathrm{MIM}} = -\sum_{t\in T} \log p(x_t \mid x_{-t})

—and contrastive learning, which encourages representations that are invariant to augmentations:

LCL=logexp(sim(hi,hj)/τ)k=1Nexp(sim(hi,hk)/τ)L_{\mathrm{CL}} = -\log \frac{\exp(\mathrm{sim}(h_i, h_j)/\tau)}{\sum_{k=1}^N \exp(\mathrm{sim}(h_i, h_k)/\tau)}

Some PFMs extend further, employing cross-modal contrastive loss for aligning vision and language (as in CONCH or PRISM):

LITC=(v,t)logσ(fv(v)ft(t))/τL_{\mathrm{ITC}} = -\sum_{(v, t)} \log \sigma(f_v(v)\cdot f_t(t))/\tau

The data pipeline involves pretraining on massive datasets (up to several million WSIs, e.g., Virchow or Mass-100K), tiling each WSI into hundreds of millions of patches, and leveraging augmentation and normalization (Reinhard, Macenko algorithms) to ensure stain and scanner robustness (Ochi et al., 31 Jul 2024).

2. Downstream Applications and Performance Benchmarks

PFMs demonstrate strong transferability, enabling efficient fine-tuning on new tasks with limited annotated data. Key clinical applications include:

  • Disease Diagnosis and Subtyping: PFMs achieve AUCs of 0.95–0.97 on colorectal and breast cancer benchmarks, including >90% accuracy on rare cancer multi-class tasks.
  • Survival Prognosis: Slide-level PFM embeddings deliver C-indices of 0.72–0.78 for 5-year survival prediction in major cancer cohorts.
  • Biomarker Expression: Tasks such as microsatellite instability and PD-L1 expression reach AUCs of 0.85–0.92.
  • IHC Scoring: Automated scoring of immunohistochemical markers yields Pearson correlations r = 0.80–0.88 with pathologist ground truth, reducing interobserver variability.
  • Segmentation/Object Detection: Fine-tuned PFMs (e.g. with U-Net or custom decoders) obtain mean Dice coefficients of 0.82–0.88 on mitosis/gland segmentation.
  • Image-to-Text and Report Generation: Multimodal models (e.g., PLIP, CONCH) report retrieval recall@1 values ~0.65 for WSI-to-report matching, facilitating high-fidelity automatic reporting.

These results frequently surpass the state-of-the-art achieved by dedicated, fully supervised CNNs or single-purpose architectures (Ochi et al., 31 Jul 2024).

3. Robustness, Domain Shift, and Security

Clinical deployment challenges stem primarily from performance degradation under domain shift (variation in staining, scanner, or tissue processing), and from unexpected vulnerabilities to adversarial manipulations:

  • Domain shift can cause substantial accuracy loss; countermeasures include stain normalization, color and stain augmentation, and domain-adaptive fine-tuning.
  • Security risks: PFMs exhibit sensitivity to imperceptible adversarial perturbations, where altering as little as 0.1% of WSI patches can degrade accuracy by up to 33%. Smaller models (e.g., CHIEF) tend to be more robust than larger ones under attack. Simple defenses such as uniform noise injection can partially restore clean accuracy (Liu et al., 30 May 2025).
  • Feature contamination by site-specific factors: Hospital or scanner bias can be removed with adversarial approaches (e.g. gradient reversal adapters), dramatically reducing domain-predictability while retaining or slightly improving diagnostic performance (Zhang et al., 20 Aug 2025).

Evaluations using external datasets and new clinical sites are key to ensuring PFMs yield reliable predictions in diverse real-world scenarios.

4. Adaptation, Fine-Tuning, and Resource Efficiency

Adaptation strategies for PFMs span full fine-tuning, linear probing, and parameter-efficient methods such as LoRA. Recent work affirms the efficacy of LoRA, which updates only a low-rank subset of weights, achieving accuracy improvements of 2–10 percentage points over conventional fine-tuning while incurring minimal computational overhead (≤1% extra parameters) (Lee et al., 21 Oct 2024).

For weak-label, whole-slide tasks (such as mutation prediction), dual-graph approaches like TAPFM decouple aggregator and PFM parameter updates, supporting stable, memory-efficient adaptation for WSI-level multiple instance learning on single GPUs (Kumar et al., 5 Jun 2025).

PFM resource requirements remain high, with top models consuming up to 35× more energy than streamlined task-specific (TS) models. Notably, while PFMs excel in data-scarce scenarios, their performance converges with (or is exceeded by) end-to-end TS models as data volumes increase, with TS architectures providing superior energy efficiency and robustness across scanners and rare morphologies (Mulliqi et al., 28 Feb 2025).

5. Multimodal and Multiscale Innovations

State-of-the-art PFMs are increasingly multimodal, integrating vision backbones with BERT-style text encoders, slide-level pathology report analysis, and (in emerging work) molecular and omics data. Multimodal pretraining—contrastively aligning image, text, and molecular features—lets models like mSTAR and EXAONE Path 2.5 deliver improved classification, survival analysis, and report generation, especially on molecular or rare subtypes (Xu et al., 22 Jul 2024, Yun et al., 16 Dec 2025).

Architectural advances include multi-scale cross-attention, slide-level aggregation heads, and spatially-aware positional encoding to capture both patch-level detail and holistic tissue context.

Further, ensemble and adaptive fusion methods (e.g. AdaFusion) align and dynamically weight the outputs of multiple PFMs, outperforming individual models and enriching interpretability by exposing which PFM specializes in a given morphological scenario (Xiao et al., 7 Aug 2025).

6. Limitations, Ethical Risks, and Future Outlook

Despite high average performance, PFMs face systemic limitations:

  • Spurious correlations and ethical risks: PFMs can encode non-biological technical features (stain, scanner, site), leaking sensitive information, reducing fairness, and causing diagnostic errors on out-of-distribution data (Lin et al., 24 Feb 2025, Kömen et al., 22 Jul 2025).
  • Failure modes: Architectural scaling does not guarantee robustness; fine-tuning large models on small datasets is unstable, and frozen PFM backbones can propagate institution-specific bias without careful mitigation (e.g., adversarial or federated adaptation).
  • Interpretability: Model outputs, especially from generative/transformer-based PFMs, can appear plausible but are susceptible to hallucinations.
  • Experimental caveats: Many reported results are derived from internal or retrospective datasets; real-world, multi-center validation under regulatory-grade conditions remains limited.

The future research agenda calls for generalist medical AI platforms that unite PFMs with other medical domain FMs (e.g., radiology, genomics), enabling real-time, multi-modal clinical decision support and precision medicine. Integration of explainable AI methods, scaling of federated/multi-institutional training, and development of lightweight, on-device PFMs for deployment in resource-constrained settings are identified as priorities (Ochi et al., 31 Jul 2024).

7. Comparative Summary of PFM Technology

Model Type Main Contributions Characteristic Tasks
Vision-only (ViT) Self-supervised or MIM on millions of WSIs Tumor subtyping, segmentation
Vision-language Contrastive image-text pretraining (CLIP, CoCa) Report generation, retrieval
Multimodal Joint vision, text, omics integration Molecular prediction, prognosis
Ensemble/fusion Adaptive combination of multiple PFMs Robust diagnosis, interpretability

While PFMs have established new technical standards in computational pathology, their clinical adoption depends on ongoing work in security, equity, explainability, and resource efficiency, paralleled by systematic benchmarking (e.g., PathBench) to support objective model selection and validation (Ma et al., 26 May 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Pathology Foundational Models (PFMs).