EXAONE Path 2.0: Hierarchical Pathology Transformer
- EXAONE Path 2.0 is a pathology foundation model that employs a three-stage hierarchical transformer to produce biomarker-predictive representations from digital whole-slide images.
- It integrates a HIPT backbone with efficient attention mechanisms like FlashAttention 2.0 and CPU offloading to optimize processing of gigapixel-scale WSIs.
- End-to-end slide-level supervision with curriculum learning on 37K annotated WSIs enhances clinical relevance and outperforms traditional SSL+MIL methods in biomarker classification.
EXAONE Path 2.0 is a pathology foundation model for digital whole-slide images (WSIs) utilizing end-to-end supervision and a hierarchical transformer architecture to produce biomarker-predictive representations. It addresses limitations of patch-level self-supervised learning (SSL) by injecting direct slide-level supervision throughout the encoder hierarchy, thus improving data efficiency and clinical relevance for biomarker classification tasks using a relatively small set of annotated WSIs (Pyeon et al., 9 Jul 2025).
1. Hierarchical Transformer Architecture
The core of EXAONE Path 2.0 is a three-stage Hierarchical Image Pyramid Transformer (HIPT) backbone. Each stage is implemented with Vision Transformer (ViT) blocks utilizing multi-head self-attention (MHSA). The stages are:
- Stage 1 (Patch encoder): Processes non-overlapping 256×256-pixel image patches at 20× WSI magnification. Self-supervised DINO-style pretraining is used in early curriculum phases.
- Stage 2 (Region encoder): Aggregates groups of patch tokens into 1024×1024-pixel regional tokens, processed with additional ViT layers.
- Stage 3 (Slide encoder): Integrates region-level representations into a global token capturing whole-slide features, using ViT blocks.
All stages maintain an embedding dimension and number of attention heads , but differ in token sequence lengths according to their hierarchical scope (patches, regions, full slide).
Efficient Attention Mechanisms:
- FlashAttention 2.0: Employed to reduce the complexity and memory consumption typically associated with self-attention over large token sets.
- Activation Checkpointing and CPU Offloading: Intermediate activations from all transformer stages are stored on CPU and loaded back onto GPU only during back-propagation, optimizing GPU RAM usage for gigapixel-scale input data.
Slide-level Aggregation:
EXAONE Path 2.0 utilizes the CLAM aggregator for downstream biomarker classification. Patch features are weighted by attention scores through a small MLP, then summed:
The aggregated slide vector is input to a linear classifier.
2. End-to-End Supervision and Training Protocol
Unlike standard SSL+MIL two-stage pipelines, EXAONE Path 2.0 backpropagates slide-level cross-entropy loss through all three hierarchical encoder stages, directly tuning patch and region filters for biomarker-predictive features.
Training is conducted in two curriculum phases:
- Phase I: DINO-style self-supervised losses on patch and region encoders to learn robust low-level and mid-level features.
- Phase II: Slide-level cross-entropy loss across a 16-task output (33 cancer subtypes, 12 tissue classes, 10 molecular biomarkers), added to continued DINO losses at both patch and region levels.
The total mini-batch loss function is:
with and hyperparameter chosen as $1.0$ for optimal trade-off between representation robustness and biomarker signal.
Key term definitions:
- : Patch-level embedding (Stage 1 output)
- : CLAM attention weight for patch
- : Slide-level feature vector
- , : Ground-truth and predicted one-hot labels for classification task
3. Dataset Composition and Data Efficiency
EXAONE Path 2.0 is trained using 37,195 FFPE H&E-stained WSIs at 20× magnification, sourced from four institutions (KOR, USA1, USA2, CPTAC). The model is trained jointly across 16 tasks, spanning cancer subtyping (33 classes), tissue classification (12 organs), and 10 molecular biomarker tasks, yielding 144,450 image–label pairs post multi-task annotation.
Data Efficiency:
EXAONE Path 2.0 (≈180M parameters) achieves an average AUROC of $0.784$ across 10 biomarker tasks using only 37K WSIs. By contrast, SSL-pretrained models (e.g., DINOv2, MAE with MIL) require more than 100K WSIs to reach similar performance, highlighting the benefit of integrated end-to-end supervision in focusing representation learning on clinically meaningful signals.
4. Performance Across Biomarker Prediction Benchmarks
Performance is benchmarked against seven leading pathology classifiers. Table 1 presents AUROC scores on 10 heterogeneous biomarker tasks; EXAONE Path 2.0 ranks highest on average and leads all baselines in five out of ten tasks (EGFR, KRAS, BAP1, COAD-KRAS).
| Task | TITAN | PRISM | CHIEF | Prov-GigaPath | UNI2-h | EXAONE Path 1.0 | EXAONE Path 2.0 |
|---|---|---|---|---|---|---|---|
| LUAD-TMB (USA1) | 0.690 | 0.645 | 0.650 | 0.674 | 0.669 | 0.692 | 0.664 |
| LUAD-EGFR (USA1) | 0.754 | 0.815 | 0.784 | 0.709 | 0.827 | 0.784 | 0.853 |
| LUAD-KRAS (USA2) | 0.541 | 0.623 | 0.468 | 0.511 | 0.469 | 0.527 | 0.645 |
| CRC-MSI (KOR) | 0.937 | 0.943 | 0.927 | 0.954 | 0.981 | 0.972 | 0.938 |
| BRCA-TP53 (CPTAC) | 0.788 | 0.842 | 0.788 | 0.739 | 0.808 | 0.766 | 0.757 |
| BRCA-PIK3CA (CPTAC) | 0.758 | 0.893 | 0.702 | 0.735 | 0.857 | 0.735 | 0.804 |
| RCC-PBRM1 (CPTAC) | 0.638 | 0.557 | 0.513 | 0.527 | 0.501 | 0.526 | 0.583 |
| RCC-BAP1 (CPTAC) | 0.719 | 0.769 | 0.731 | 0.697 | 0.716 | 0.719 | 0.807 |
| COAD-KRAS (CPTAC) | 0.764 | 0.744 | 0.699 | 0.815 | 0.943 | 0.767 | 0.912 |
| COAD-TP53 (CPTAC) | 0.889 | 0.816 | 0.701 | 0.712 | 0.783 | 0.819 | 0.875 |
| Average | 0.748 | 0.765 | 0.696 | 0.707 | 0.755 | 0.731 | 0.784 |
The radar plot in Figure 1 visualizes consistent, high performance across these tasks.
5. Analysis via Ablation and Controlled Experiments
Several controlled experiments elucidate design choices:
- End-to-End vs. Patch-Only Supervision: Removing slide-level cross-entropy in final training phases yields a 4-point average AUROC decrease (0.784 → 0.742), emphasizing the necessity of direct end-to-end supervision.
- Model Depth: Early-exit adaptation (using only Stage 1 + CLAM at inference) yields similar average AUROC (0.780) to full three-stage models, supporting robust patch features and reducing the need for heavy slide-level Transformer inference.
- Curriculum Learning: Omitting initial DINO warm-up (i.e., cold end-to-end training) leads to less stable optimization and slightly reduced performance (0.784 → 0.768).
- Parameterization: for CE loss is confirmed optimal; lower values insufficiently weight supervision, higher values overfit slide labels.
6. Clinical Pipeline Integration, Extensions, and Limitations
Integration: Only patch features and CLAM aggregation are used at inference—full-slide ViT encodings need not be computed—enabling deployment in resource-constrained digital pathology settings. High accuracy on clinically relevant biomarker tasks position EXAONE Path 2.0 as an assistive tool for molecular triage, particularly where sequencing infrastructure is lacking.
Extensibility:
- Segmentation & Morphometry: The patch encoder can be repurposed with pixel-level decoders for segmentation tasks.
- Multimodal Fusion: Stage 3 can be extended by cross-modal attention for integrating clinical metadata or radiology images.
Limitations and Prospective Directions:
- Compute & Memory: Training at full gigapixel resolution still requires multi-GPU clusters; scaling to higher magnifications (e.g., 40×) presents new challenges.
- Label Granularity: Slide-level labels are insufficient for localized training; integration of weak or point-level annotation may address sub-slide heterogeneity.
- Model Interpretability: Future work is needed on visual explainability (e.g., attention heatmaps validated by pathologists) for enhanced clinical adoption.
EXAONE Path 2.0 demonstrates that the combination of hierarchical transformer architectures, curriculum learning, memory-efficient attention, and end-to-end supervision produces a data-efficient pathology foundation model, advancing slide-level biomarker prediction across diverse indications (Pyeon et al., 9 Jul 2025).