PLUTO: Pathology Foundation Model
- PLUTO is a pathology foundation model family offering multi-scale, sample-efficient image representations across varied histopathology tasks.
- It employs specialized Vision Transformer architectures with flexible patch sizing and robust pre-training on extensive, multi-institutional datasets.
- PLUTO achieves competitive benchmark results, out-of-domain robustness, and efficient deployment for both clinical and research applications.
The PathoLogy Universal TransfOrmer (PLUTO) is a pathology foundation model family designed to provide multi-scale, robust, and sample-efficient image representations across diverse histopathology tasks. PLUTO models are pre-trained on large, multi-institutional whole slide image (WSI) corpora and employ transformer backbones with architectural designs tailored for flexibility, computational efficiency, and strong transfer learning across tissue, stain, and scanner domains. The PLUTO series covers both compact (PLUTO, PLUTO-4S) and frontier-scale (PLUTO-4G) implementations, each optimized for distinct deployment scenarios and research applications (Juyal et al., 2024, Padigela et al., 4 Nov 2025).
1. Model Architectures
PLUTO models utilize Vision Transformer (ViT) backbones, each with specific configurations to balance representation capacity and practical deployability.
- PLUTO: Employs FlexiViT-S (ViT-Small) with 12 transformer blocks, hidden dimension , 6 attention heads, and MLP dimension , totaling approximately 22 million parameters. FlexiViT enables variable patch sizes , facilitating inference-time selection of context length and computational throughput.
- PLUTO-4S: Extends the ViT-S family with the same parameter count (22 M), 12 transformer layers, hidden size 384, and 6 heads. Supports FlexiViT multi-scale deployment, allowing for patch size selection according to task granularity.
- PLUTO-4G: Operates at "frontier-scale" using a ViT-G/14 architecture (1.1 billion parameters, 40 transformer layers, hidden size 1,408, MLP dimension 5,632, 16 attention heads). Uses a fixed 14×14 patch size for stability and capacity, with 2D rotary positional embeddings (2D-RoPE) providing coordinate-wise relative encoding resilient to very long sequences:
and similarly for the -axis. Register tokens are included to absorb outlier activations in very deep models (Padigela et al., 4 Nov 2025).
2. Pre-training Data and Objectives
PLUTO was pre-trained on 195 million image tiles drawn from 158,852 WSIs, spanning over 50 sites, 16 tissue groups, 28 disease categories, 11 scanner types, and diverse stain modalities (H&E FFPE/frozen, >100 IHC markers, special stains). PLUTO-4 models expanded the pre-training set to 551,164 WSIs from 137,144 patients and over 100 stains, with systematic coverage of >60 disease types and >10 scanner models.
All models use comprehensive preprocessing pipelines:
- Artifact/background region masking via PathAI’s ArtifactDetect CNN.
- Multi-resolution sampling at 40×, 20×, 10×, and 5× magnifications.
- Data augmentation employing two global crops (224×224 pixels), four local crops (96×96), random flips, color jitter, and injection of 4 million pathologist-delineated region samples (for diversity enrichment) (Juyal et al., 2024).
The PLUTO pre-training loss is a composite of four self-supervised terms:
The novel Fourier-band reconstruction loss targets both low- and high-frequency restoration in masked image regions:
where is the discrete Fourier transform, is a low-pass frequency mask, (low-frequency), (high-frequency) (Juyal et al., 2024).
For PLUTO-4, pre-training follows DINOv2’s teacher-student, multi-view siamese approach with symmetric cross-entropy, momentum-updated teacher network, and global/local crop augmentations. Precision is managed via bfloat16, with float32 for loss/center updates (Padigela et al., 4 Nov 2025).
3. Task-Specific Adaptation and Downstream Heads
PLUTO backbones are used as frozen feature extractors, with lightweight adaptation heads for diverse histopathology tasks operating across the WSI pyramid:
- Slide-level Prediction (Level 1): Attention-based Multiple Instance Learning (MIL) head pools tile-level embeddings using non-linear attention (as in Ilse et al.):
- Tile Classification (Levels 2 and 3): Single linear or 2-layer MLP heads on the CLS token, optionally concatenated with mean-pooled or attention-pooled patch tokens.
- Instance Segmentation (Levels 2 and 3): Mask R-CNN+ViT-Adapter and Mask2Former+ViT-Adapter frameworks convert transformer features to a multi-scale FPN, supporting per-query mask prediction, class logits, and standard detection/segmentation losses (Juyal et al., 2024).
For evaluation, linear probes and lightweight heads are trained on frozen representations, supporting patch-level, instance-level, and slide-level performance assessment (Padigela et al., 4 Nov 2025).
4. Experimental Evaluation and Benchmarks
PLUTO and PLUTO-4 models have been benchmarked on both public and proprietary pathology datasets, spanning multiple resolutions, tissue types, and task categories.
Selected Quantitative Results
| Task / Dataset | Adaptation Head | PLUTO / PLUTO-4 Metric | SOTA Baseline |
|---|---|---|---|
| NSCLC Subtyping (slide) | MIL | F1: 90.2 / AUROC: 94.0 | Meta-DINOv2 ViT-S: F1: 88.6 / AUROC: 92.0 |
| HER2 IHC (slide) | MIL | F1: 71.5 / AUROC: 89.5 | Meta-DINOv2 ViT-S: F1: 56.4 / AUROC: 83.4 |
| CRC-100K (tile, 9 cls) | Linear probe | Acc: 96.6, Bal Acc: 95.3 | ResNet50: 94.7 |
| GlaS (gland segm.) | Mask2Former | Dice: 91.2, IoU: 84.5 | U-Net: Dice: 85.5 |
| MoNuSAC (nucleus, Dice) | Linear probe (PLUTO-4G) | 70.4 | SOTA: 66.9–68.5 |
| Derm-2K (slide, F1) | Linear probe (PLUTO-4G) | 67.1 | H-Optimus-0: 62.8 |
All evaluations used frozen encoders; PLUTO-4G consistently outperformed PLUTO-4S, and both outperformed prior supervised and self-supervised models across all benchmarked tasks (Juyal et al., 2024, Padigela et al., 4 Nov 2025).
PLUTO models demonstrated out-of-domain (OOD) robustness, with improvements on external site distributions, attributed to dataset diversity in both source institution and stain modality. Ablations showed patch size scaling provided trade-offs between throughput and minor accuracy changes, and that inclusion of Fourier loss improved tile classification OOD performance by approximately 1–2%.
5. Sample Efficiency, Generalization, and Scaling
Despite using 10×–100× fewer pretraining images or smaller parameter counts than contemporary pathology foundation models, the PLUTO family matches or surpasses state-of-the-art performance on OOD and multi-task benchmarks. This efficiency is primarily attributed to the extensive diversity of the pretraining dataset across sites, tissues, stains, and scanners (Juyal et al., 2024).
PLUTO-4G demonstrates scaling advantages in absolute performance, particularly for tasks requiring wide spatial context or fine-grained distinction, such as dermatopathology diagnosis (11% improvement over previous series) and spatial transcriptomics correlation. When high throughput and flexible deployment are required, PLUTO-4S (or base PLUTO) achieves strong results with reduced computational resource requirements (Padigela et al., 4 Nov 2025).
6. Computational Performance and Deployment
PLUTO models emphasize computational efficiency for real-world clinical and research pipelines. ViT-S backbones in PLUTO/PLUTO-4S are 2.5× to 15× faster in tile throughput than larger ViT variants, with up to 2× throughput gains when operating at the largest patch size . Model memory requirements (16–24 GB GPU for typical inputs), and parameter counts (~22 M for PLUTO/PLUTO-4S) ensure practical deployment on commodity hardware (Juyal et al., 2024).
PLUTO-4S supports multi-scale inference for tasks ranging from cell and region proposals to slide triage. PLUTO-4G, while more demanding, addresses large-scale biomarker quantification and advanced spatial omics applications. These models have been integrated in clinical and research products such as PathExplore, IHCExplore, TumorDetect, and AIM quantification suites (Padigela et al., 4 Nov 2025).
7. Limitations and Future Directions
PLUTO models, including PLUTO-4 variants, provide frozen backbone representations; task-specific adapters remain essential for robust deployment. Fine-tuning the last layers of the backbone, integrating hierarchical ViT architectures (e.g., Swin), and exploring full model parallelism (e.g., FSDP) represent architectural extension points.
Continued study of empirical scaling laws—balancing dataset diversity, volume, model size, and compute—is suggested as critical for further advances. Extensions to cross-modal pretraining (e.g., combining WSI data with genomics or transcriptomics), interpretable adaptation heads, uncertainty quantification, and support for new imaging modalities present active research avenues.
A plausible implication is that universal pathology-specific foundation embeddings such as those learned by PLUTO could underlie next-generation digital pathology analysis pipelines, reducing annotation burden and improving generalization across unseen site and technical confounders (Juyal et al., 2024, Padigela et al., 4 Nov 2025).