Slide-Level Foundation Models
- Slide-Level Foundation Models are neural architectures that aggregate thousands of patch features from whole-slide images into a unified representation, enabling robust predictions for biomarkers, mutations, and cancer subtypes.
- They employ diverse training paradigms—supervised, self-supervised, and multimodal pretraining—to efficiently learn from heterogeneous pathology data while minimizing fine-tuning.
- Recent models leverage hierarchical vision transformers, attention-based MIL, and memory-efficient compression techniques to enhance data efficiency, domain adaptation, and overall clinical applicability.
A slide-level foundation model is a neural architecture in computational pathology designed to learn transferable representations that summarize the global histomorphological context of entire whole-slide images (WSIs). Unlike patch-level models, which encode small image regions independently, slide-level models explicitly aggregate local features across thousands of patches to produce a single vector for each slide, enabling robust prediction tasks such as biomarker status, mutation, cancer subtyping, or prognostic outcomes under minimal fine-tuning. The modern landscape includes supervised, self-supervised, and multimodal (vision-language/omics) approaches, with architectures ranging from hierarchical vision transformers and attention-based MIL networks to efficient hybrid fusion schemes. Slide-level supervision and end-to-end training provide data-efficient learning and resilience to domain shift, facilitating state-of-the-art performance on diverse real-world pathology datasets.
1. Core Architecture and Principles of Slide-Level Foundation Models
Slide-level foundation models generally comprise two major components: a patch-level feature extractor (e.g., ViT, Swin transformer, CONCH, Virchow, UNI), and a slide aggregator that combines hundreds to tens of thousands of patch or tile embeddings into a single slide representation. Early approaches relied on simple pooling strategies (global mean, attention-based pooling), but recent models favor hierarchical transformers, perceiver structures, or gated multi-head attention for more expressive aggregation (Xiong et al., 5 Apr 2025).
EXAONE Path 2.0, for instance, utilizes a three-stage hierarchical ViT pipeline: (1) a standard ViT encodes 256×256 patches, (2) another ViT aggregates grids of patch embeddings into 1024×1024 regions, and (3) a lightweight ViT pools second-stage region tokens into a slide-wide [CLS] embedding for biomarker prediction (Pyeon et al., 9 Jul 2025). Each stage enables progressively coarser abstraction while facilitating end-to-end slide-level supervision.
MIL-based models treat the slide as a bag of instances, learning an attention or mean-pooling scheme over patch features. More advanced models, such as PRISM, exploit Perceiver encoders to compress large bags (up to 100,000 tiles) with memory-efficient cross-attention, and utilize multimodal decoders for generative report synthesis (Shaikovski et al., 2024). Hybrid models like TCv2 employ shared multi-head attention pooling with dropout-regularized linear heads for multitask prediction (Nicke et al., 8 Jul 2025).
2. Training Paradigms: Supervised, Self-Supervised, and Multimodal Pretraining
Training strategies for slide-level models span from fully supervised end-to-end multitask label prediction to large-scale self-supervised learning (SSL), and multimodal alignment via contrastive objectives. In EXAONE Path 2.0, direct slide-level supervision propagates clinically relevant gradients through the network, shaping patch and region representations without requiring patch-level annotation (Pyeon et al., 9 Jul 2025). The loss combines multi-task cross-entropy and local (patch/region) DINO self-distillation terms, balancing explicit label modeling with feature regularization.
Self-supervised paradigms adapt natural-image techniques, including contrastive InfoNCE (COBRA, TITAN), masked image modeling (MAE, SimMIM, PathVQ), and teacher–student self-distillation (DINO, iBOT). Slide-level SSL objectives apply these losses after aggregation; for example, PathVQ utilizes multi-scale vector quantization and decoder-based masked modeling to compress spatial patch tokens and supervise region-level aggregators (Li et al., 9 Mar 2025). COBRA employs multi-FM tile augmentations with contrastive alignment via Mamba-2 SSD layers, enabling FM-agnostic patient-centric slide embeddings (Lenz et al., 2024).
Multimodal models inject knowledge from clinical text, gene expression, or synthetic region captions using vision-language contrastive learning. For example, TITAN aligns slide features with both synthetic captions and natural pathology reports via CoCa loss, supporting zero-shot retrieval and report generation (Ding et al., 2024); mSTAR combines H&E patches, OCR’d text, and RNA-Seq features in a shared embedding space with CLIP-style contrastive and triplet losses (Xu et al., 2024); Threads uses molecular-driven contrastive pretraining to project slide, transcriptomic, and genomic features into a unified space (Vaidya et al., 28 Jan 2025).
3. Data Efficiency, Scalability, and Diversity Considerations
Recent work demonstrates that diversity of WSIs across staining, scanner, and demographic sources is more crucial for generalization than sheer patch count. Athena, for instance, trains a 1.1B ViT-G/14 model with only 115M patches—orders of magnitude fewer than prior FMs—by maximizing slide-level diversity (24 countries, multiple scanners) within a balanced patch sampling protocol (Bosch et al., 13 Nov 2025). This approach yields near state-of-the-art performance across molecular and morphological tasks, supporting the hypothesis that broad coverage of real-world slide characteristics drives foundation model utility.
TCv2 leverages resource-efficient end-to-end multitask supervised learning on ~22k public WSIs, requiring ~15% of the GPU footprint of large SSL slide encoders while providing competitive or superior benchmarks (Nicke et al., 8 Jul 2025). PathoDuet demonstrates that cross-scale and cross-stain pretext modeling allows high data-efficiency, outperforming even large-scale generic ViT FMs in patch-level and slide-level classification by encoding microscopy-specific contextual knowledge (Hua et al., 2023).
4. Evaluation Methodologies and Benchmarking
Slide-level foundation models are evaluated via downstream transfer to clinical tasks: WSI classification (subtyping, diagnosis), biomarker prediction, survival analysis, zero/few-shot learning, and cross-modal retrieval. Standard metrics include AUROC, balanced accuracy, concordance index (C-index), F1 score, and retrieval mAP (Xiong et al., 5 Apr 2025). Linear probing and fine-tuning protocols remain dominant: models freeze the slide embedding for logistic regression, Cox modeling, or more complex downstream heads.
AUROC comparisons across leading models (EXAONE Path 2.0, TITAN, PRISM, PathVQ, COBRA, Athena) show consistent state-of-the-art gains or competitive performance on diverse benchmarks (e.g., LUAD-EGFR, CRC-MSI, BRCA-TP53), with slide-level supervision yielding improved data efficiency and stronger generalization (Pyeon et al., 9 Jul 2025, Ding et al., 2024, Li et al., 9 Mar 2025, Lenz et al., 2024, Bosch et al., 13 Nov 2025). Fine-tuned PRISM and Threads models further show remarkable label-efficiency, matching or exceeding scratch-trained baselines with as little as 10–30% of the data (Shaikovski et al., 2024, Vaidya et al., 28 Jan 2025).
Fusion and collaborative distillation schemes such as FuseCPath optimize for ensemble robustness: soft label alignment between heterogeneous slide FMs and patch-based networks via KL divergence yields monotonic gains in AUROC, MSE, and C-index as teacher count increases (Yang et al., 31 Oct 2025).
5. Interpretability, Explainability, and Downstream Adaptation
Most slide-level architectures support interpretability via attention maps or feature importance scores. TCv2’s shared multi-head attention pooling yields per-patch alpha weights, visualized as heatmaps over WSIs, matching biological ground truths in collagen-rich regions (Nicke et al., 8 Jul 2025). MIL-based strategies (CLAM, ABMIL) directly expose patch-level contributions to final slide predictions, facilitating clinical validation and exploration.
Model-agnostic slide encoders (COBRA, TICON) enable compatibility with novel patch FMs at inference: any new feature extractor can be integrated without retraining the main slide encoder, supporting rapid deployment across evolving datasets and improving robustness to domain shift (Lenz et al., 2024, Belagali et al., 24 Dec 2025).
In few-shot or low-data regimes, lightweight pooling+MLP heads (SiMLP) provide comparable accuracy to complex MIL pipelines, with remarkable transfer stability and minimal overhead (Li et al., 28 Feb 2025).
6. Limitations, Challenges, and Future Research Trajectories
Current limitations include memory and compute constraints for gigapixel slides, imperfect capture of rare histologic events due to reliance on coarse slide-level labels, and residual domain adaptation challenges (stain, scanner variability). Many end-to-end approaches still struggle to scale nonlinear aggregation to WSI-level contexts, motivating ongoing work in sparse transformers (LongNet, Mamba), region-guided dynamic sampling, and federated learning for privacy-preserving model expansion (Xiong et al., 5 Apr 2025, Pyeon et al., 9 Jul 2025).
Pathology-specific pretext tasks (multi-magnification alignment, stain-aware contrastive modeling, cross-scale positioning) are underexplored yet show promise in reducing reliance on massive labeled datasets and unlocking new modalities (IHC, spatial transcriptomics, radiology) (Hua et al., 2023, Xu et al., 2024). Unified multimodal slide models (CPath-Omni, Threads, mSTAR) point toward holistic clinical decision support combining image, omics, and text, supporting zero-shot diagnosis, retrieval, and report generation (Sun et al., 2024, Vaidya et al., 28 Jan 2025, Xu et al., 2024).
A plausible implication is that the coupling of hierarchical end-to-end supervision with multimodal alignment may represent the next phase of slide-level foundation model development, as models integrate broader clinical context and molecular phenotyping while retaining computational efficiency and interpretability.
7. Representative Models and Empirical Performances
| Model | Key Methodology | Slide-Level AUROC/Acc/C-index (average) | WSIs used for Training |
|---|---|---|---|
| EXAONE Path 2.0 | Hierarchical ViT, slide-level supervision | 0.784 (10 biomarkers) | 37,195 |
| Athena | Diversity-focused ViT-G, DINOv2 SSL | Up to 0.957 (MSI), 0.9485 (IDC vs ILC) | 282,550 (115M patches total) |
| TCv2 | End-to-end Swin-T, multitask supervision | 0.95 (NSCLC), 0.84 (BRACS) | ~22,000 |
| TITAN | Multimodal ViT, CoCa V-L, retrieval | +8.4% bal. acc, +3.6% c-index | 335,645 |
| PRISM | Virchow+Perceiver, report alignment | 0.958 (BRCA), high label efficiency | 195,344 specimens |
| Threads | Molec.-contrastive, attention-MIL | 0.758 (mean), +6.3–9.9% over baseline | 47,171 WSI–omics pairs |
| PathVQ | Multi-scale VQ, SSL slide pretraining | 0.902–0.906 (BRACS, AUC) | 250k regions |
| SiMLP | Mean pooling+MLP, task-agnostic | +3.52% bal. acc (pan-cancer) | Various FMs |
| COBRA | Multi-FM, Mamba-2 Contr. SSL | 71.6% AUC (15 tasks, few-shot robust) | 3,048 |
| CPath-Omni | Unified LMM, CLIP+ViT, report/VQA | 82.8% bal. acc (TCGA), SOTA on 39/42 | 85% WSI + 15% patch data |
Values are drawn directly from reported benchmarks; for full per-task results, see respective papers (Pyeon et al., 9 Jul 2025, Bosch et al., 13 Nov 2025, Nicke et al., 8 Jul 2025, Ding et al., 2024, Shaikovski et al., 2024, Vaidya et al., 28 Jan 2025, Li et al., 9 Mar 2025, Li et al., 28 Feb 2025, Lenz et al., 2024, Sun et al., 2024).
References
- EXAONE Path 2.0: Pathology Foundation Model with End-to-End Supervision (Pyeon et al., 9 Jul 2025)
- Diversity Over Scale: Whole-Slide Image Variety Enables H&E Foundation Model Training with Fewer Patches (Bosch et al., 13 Nov 2025)
- Tissue Concepts v2: a Supervised Foundation Model for whole slide images (Nicke et al., 8 Jul 2025)
- Multimodal Whole Slide Foundation Model for Pathology (TITAN) (Ding et al., 2024)
- PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology (Shaikovski et al., 2024)
- Molecular-driven Foundation Model for Oncologic Pathology (Threads) (Vaidya et al., 28 Jan 2025)
- PathVQ: Reforming Computational Pathology Foundation Model for Whole Slide Image Analysis via Vector Quantization (Li et al., 9 Mar 2025)
- Can We Simplify Slide-level Fine-tuning of Pathology Foundation Models? (SiMLP) (Li et al., 28 Feb 2025)
- Unsupervised Foundation Model-Agnostic Slide-Level Representation Learning (COBRA) (Lenz et al., 2024)
- CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology (Sun et al., 2024)
- A Survey of Pathology Foundation Model: Progress and Future Directions (Xiong et al., 5 Apr 2025)
- PathoDuet: Foundation Models for Pathological Slide Analysis of H&E and IHC Stains (Hua et al., 2023)