PLUTO: Universal Pathology Transformer Models

Updated 16 May 2026

PLUTO is a family of pathology foundation models that use Vision Transformers to extract multi-scale representations from whole-slide images.
The models include compact and giant variants, with architectures optimized for both real-time analyses and high-accuracy research applications.
They leverage multi-objective self-supervision and diverse, multi-institutional data to enable robust downstream tasks in digital pathology.

The PathoLogy Universal TransfOrmer (PLUTO) is a family of pathology foundation models designed to extract multi-scale representations from whole-slide images (WSIs) for diverse digital pathology tasks. PLUTO leverages Vision Transformer (ViT) architectures augmented for domain specificity and is pre-trained on large, multi-institutional pathology corpora via a combination of self-supervised objectives. The PLUTO line encompasses both a compact, efficient backbone (original PLUTO and PLUTO-4S) and a frontier-scale variant (PLUTO-4G), enabling scalable deployment from real-time analyses to state-of-the-art performance in data-intensive research settings (Padigela et al., 4 Nov 2025, Juyal et al., 2024).

1. Model Architecture and Variants

PLUTO models are built on the Vision Transformer (ViT) architecture, with adaptations for multi-scale pathology image inputs. The principal variants are PLUTO (ViT-S backbone with ≈22 M parameters) and PLUTO-4, which includes both PLUTO-4S (“Small”) and PLUTO-4G (“Giant”).

PLUTO and PLUTO-4S

Backbone: ViT-Small, 12 transformer layers, hidden dimension 384, 6 heads, MLP hidden size 1536.
FlexiViT extension: Permits patch-tokenization at multiple scales ( $P\in\{8,16,32\}$ px) without retraining, facilitating the capture of diverse histological contexts.
Patch Embedding: For each patch $p_i$ from input image $X\in\mathbb{R}^{H\times W\times 3}$ ,

$z_i^0 = W_e\,\mathrm{vec}(p_i) + b_e,\quad z^0 = [z_1^0,\,z_2^0,\dots] + \mathrm{CLS}$

where $W_e\in\mathbb{R}^{d\times 3P^2}$ , $b_e\in\mathbb{R}^d$ .

Positional Encoding: Learnable 1D (PLUTO) or 2D-RoPE (PLUTO-4S) embeddings; 2D-RoPE applies a rotary transform to each spatial dimension chunk of the query/key embedding.
Decoder: Lightweight transformer decoder for masked autoencoding and reconstruction objectives.

PLUTO-4G

Backbone: ViT-G, ≈1.1 billion parameters, increased transformer depth, single patch size ( $P=14$ px).
Positional Encoding: Absolute positional embedding (learned or 1D-RoPE), no FlexiViT.
Self-Attention Mechanism: Standard ViT attention,

$\mathrm{Attn}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)\,V$

No multi-scale training: Compute focused on a single sequence length for maximal stability and representation power.

2. Self-Supervised Pretraining Objectives

PLUTO employs a multi-objective self-supervision pipeline:

DINOv2-derived teacher–student contrastive loss: Two augmented crops $X_i, X_j$ , with student (S) and teacher (T) networks. Teacher logits $t_i$ , student logits $p_i$ 0 are converted to softened distributions via temperatures $p_i$ 1, and a centering term $p_i$ 2:

$p_i$ 3

The cross-entropy loss for each positive pair is

$p_i$ 4

IBOT loss (PLUTO): Patch-level contrastive objective on local crops.
Masked Autoencoder (MAE) loss: $p_i$ 5 pixel reconstruction over masked regions.
Fourier-domain reconstruction loss: Weighted $p_i$ 6 difference in DFT domain for low- and high-frequency image components,

$p_i$ 7

with $p_i$ 8 (low frequency), $p_i$ 9 (high frequency).

The composite objective enables robust representation learning across both global slide structure and fine cellular features.

3. Training Data and Protocol

PLUTO and PLUTO-4 are pretrained on large, multi-institutional WSI corpora:

Corpus	WSIs	Sites	Diseases	Stains	Tiles
PLUTO	158,852	>50	28	>100	195 million
PLUTO-4	551,164	>50 (5 continents)	>60	>100	640 million

Clinical scope: >40 organ systems, malignant, benign, inflammatory, and normal tissue; >100 stains (H&E, IHC, special/frozen); ~10 scanner models and 4 magnifications (0.25–2.0 μm/px).
Annotations: Several million pathologist-curated ROI annotations (labels discarded at pretraining, used for task diversity).
Sampling: Tiles at patch sizes 275–550 px; multi-crop regime (global crops for DINO, local crops for iBOT).
Distributed Training: PLUTO on 64× NVIDIA A40 (22 M params); PLUTO-4G leverages large-scale distributed data-parallel training (scaling up to 32 GPUs, requirement for ≥16 GB VRAM per GPU, high-bandwidth interconnect).

4. Task-Specific Adaptation and Downstream Applications

PLUTO models are adapted for downstream tasks by attaching lightweight adaptation heads while freezing the backbone:

Slide-level prediction: Attention-based multiple instance learning (AdditiveMIL). Tile embeddings $X\in\mathbb{R}^{H\times W\times 3}$ 0 are attention-weighted,

$X\in\mathbb{R}^{H\times W\times 3}$ 1

Tile classification: MLP on CLS token or concatenation with pooled patch tokens (CLS only, mean-pool, or attention-pool).
Instance segmentation: Mask R-CNN and Mask2Former on top of PLUTO (frozen) features; adaptation heads operate on resolution-appropriate patch sizes.
Segmentation and classification at cell, tissue, and slide levels: Flexible patch sizes optimize trade-offs between context and throughput.

5. Performance Benchmarks

PLUTO and PLUTO-4 achieve state-of-the-art performance across a spectrum of public and proprietary benchmarks.

Patch-level Classification

Balanced Accuracy ( $X\in\mathbb{R}^{H\times W\times 3}$ 2):

$X\in\mathbb{R}^{H\times W\times 3}$ 3

PLUTO-4G: 87.5–96.4% on MHIST, BreakHIS, BACH, PCAM (1–3 pp better than prior FMs).

Segmentation

Dice Score:

$X\in\mathbb{R}^{H\times W\times 3}$ 4

PLUTO-4G: Dice=70.4%/65.0% on MoNuSAC/ConSep, vs prior best ≈66.9%/64.2%.
PLUTO: GlaS gland segmentation, Dice 91.2% with Mask2Former; AJI improvements of +4–12 points over ResNet50 baselines.

Slide-level Prediction

Macro-F1:

$X\in\mathbb{R}^{H\times W\times 3}$ 5

PLUTO-4G: Derm-2K macro-F1=0.671, PLUTO-3S=0.606, H-Optimus-0=0.628.
NSCLC subtyping: PLUTO F₁=90.2, AUROC=94.0 (in-domain); OOD F₁=86.1, AUROC=91.2.

Tile and Cell/Tissue-level Classification

PLUTO (CRC-100K): Acc 96.6%, BalAcc 95.3% (exceeding ResNet50).
Cell classification (9-class): Macro-F₁ 0.789 vs 0.749 (CNN baseline).

Throughput and Deployment

PLUTO-4S: 2–4× faster inference, runs on modest hardware (1–2 GB VRAM), multi-scale input support via FlexiViT.
PLUTO-4G: State-of-the-art accuracy (absolute gains of 2–4 pp on large-context tasks), but requires extensive compute, larger model shards, and distributed training infrastructure.

6. Trade-offs, Data Diversity, and Limitations

Performance trade-off: PLUTO-4S sacrifices ~2–4 pp accuracy on large-context tasks for faster inference and ease of deployment. PLUTO-4G achieves new state-of-the-art results at the expense of compute and complexity.
Data diversity: Large, heterogeneous, multi-institutional pretraining data improves out-of-distribution (OOD) robustness compared to larger, single-site models.
Model compactness: The ≈22 M parameter backbone, combined with FlexiViT, yields dramatically faster inference relative to ViT-B/L/H backbones, facilitating real-time and slide-level deployment.
Limitations: Certain complex multi-class tissue classification benchmarks (e.g., 10-way IBD) may require substantial adaptation or fine-tuning; pretraining regimes and hyperparameters offer further room for optimization.
Future directions: Empirical scaling laws in pathology (dataset vs. compute vs. diversity), more efficient adaptation head architectures, few-shot adaptation pipelines, and broader integration with multi-modal or clinical data streams.

7. Broader Significance

PLUTO establishes a robust, efficient framework for digital pathology, demonstrating that targeted architectural adaptations, diverse training data, and composite self-supervision objectives enable generalizable, high-performance representations across pathology tasks. The PLUTO family bridges research and clinical scalability, supporting both high-throughput applications and advanced diagnostic or translational research. Ongoing development emphasizes data diversity, architectural scalability, and practical adaptability, indicating a trend toward universal pathology representation backbones for the field (Padigela et al., 4 Nov 2025, Juyal et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

PLUTO-4: Frontier Pathology Foundation Models (2025)

PLUTO: Pathology-Universal Transformer (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PathoLogy Universal TransfOrmer (PLUTO).

PLUTO: Universal Pathology Transformer Models

1. Model Architecture and Variants

PLUTO and PLUTO-4S

PLUTO-4G

2. Self-Supervised Pretraining Objectives

3. Training Data and Protocol

4. Task-Specific Adaptation and Downstream Applications

5. Performance Benchmarks

Patch-level Classification

Segmentation

Slide-level Prediction

Tile and Cell/Tissue-level Classification

Throughput and Deployment

6. Trade-offs, Data Diversity, and Limitations

7. Broader Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PLUTO: Universal Pathology Transformer Models

1. Model Architecture and Variants

PLUTO and PLUTO-4S

PLUTO-4G

2. Self-Supervised Pretraining Objectives

3. Training Data and Protocol

4. Task-Specific Adaptation and Downstream Applications

5. Performance Benchmarks

Patch-level Classification

Segmentation

Slide-level Prediction

Tile and Cell/Tissue-level Classification

Throughput and Deployment

6. Trade-offs, Data Diversity, and Limitations

7. Broader Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research