ANTONI-α: Whole-Slide VLM in Pathology
- ANTONI-α is a whole-slide vision-language model designed for visual-question answering in pathology, integrating high-resolution tile encoding with multimodal fusion.
- It leverages a ResNet-style tile encoder and a single-layer cross-attention vision projector to efficiently map gigapixel images to the language model input space.
- Training on synthetic instructions and large-scale H&E datasets, ANTONI-α achieves superior organ identification and differential diagnosis accuracy compared to previous models.
ANTONI-α is a whole-slide vision-LLM (VLM) developed to facilitate visual-question answering (VQA) in pathology at the slide level. It natively processes gigapixel-scale hematoxylin and eosin (H&E) stained whole-slide images (WSIs), aligning detailed visual context with multimodal clinical task responses. Built from open components and trained entirely on publicly available synthetic instruction data, ANTONI-α establishes a reproducible benchmark for generalizable, clinically relevant VLMs in digital pathology, surpassing prior models such as MedGemma on key diagnostic tasks (Moonemans et al., 19 Dec 2025).
1. Model Architecture
ANTONI-α integrates four principal modules:
- VIRCHOW (Tile Encoder): A ResNet-style convolutional neural network pretrained on clinical histopathology providing 1280-dimensional embeddings per tile. Its parameters (~350M) remain frozen during instruction tuning.
- PRISM/CoCa (Slide Aggregator): Aggregates 1 global and 512 local tokens (all 1280-dimensional) using image–text contrastive and generative captioning objectives. Model size is ~400M parameters and weights are frozen for ANTONI-α fine-tuning.
- Vision Projector: A custom, single-layer cross-attention module with 256 learnable query tokens () attending to 513 vision latents (). This single layer maps visual information into the LLM input space. The projector is the principal trainable component in early-stage tuning (~2.5M parameters).
- MedGemma-4B-IT (LLM Decoder): A 4B parameter transformer, instruction-tuned for medical applications, with a hidden size of 3072 and 32 layers. The SigLIP input mechanism is ablated; instead, the decoder receives the 256 vision tokens from the projector.
Data flow: WSI → VIRCHOW (tile embeddings) → PRISM (aggregation: 1 global + 512 locals) → Vision projector (cross-attention, 256 tokens) → MedGemma-4B-IT → text output.
Key Design Choices:
- High-resolution tile embeddings without downsampling artifacts.
- Single-layer cross-modal alignment for efficient resource use.
- Two-stage finetuning: (1) projector warmup, (2) projector+LLM joint QLoRA adaptation.
2. Training Objectives and Loss Functions
ANTONI-α employs a standard next-token cross-entropy loss for instruction tuning:
where are the prompt, user, and image tokens, and is the ground-truth assistant token at step .
PRISM Pretraining Objectives (Contextual, Frozen in ANTONI-α):
- Image–text contrastive loss (CLIP-style):
- Captioning loss: As above, cross-entropy over text targets.
These objectives ensure robust multimodal alignments during pretraining of aggregator components.
3. Training Data and Synthetic Instruction Generation
The primary training resource is HISTAI-Instruct, derived via a synthetic pipeline over the public HISTAI archive:
- WSIs Post-Filtering: 24,259 total, with 22,530 curated for H&E tuning.
- Instruction–Response Pairs: Initially 1,118,691, filtered via LLM-based judge to 1,175,524 high-quality instances.
- Conversational Categories:
- Advanced reasoning
- Clean report
- Detailed description
- Differential diagnosis
- Multi-turn conversation
- Negative reasoning
- Short VQA
Polysome Pipeline:
- Metadata (descriptions, ICD-10, conclusions) is injected into prompt templates.
- English instructions and corresponding responses are generated in all 7 categories.
- Instances are machine-translated into Dutch, French, German, Italian, Polish, and Spanish.
- “LLM-as-Judge” filtering is applied for adherence, factuality, and reasoning quality.
- Question variation increases factual coverage and robustness.
4. Training Regimen and Hyperparameters
Hardware configuration utilizes 8 NVIDIA H200 GPUs, FSDP, and mixed bfloat16 precision.
- Optimizer: AdamW, weight decay 0.01.
- Learning Rate: Cosine schedule with 10% linear warmup.
- Batch Size: 16 sequences/GPU × 8 GPUs × 4 gradient accumulation steps = 512 effective batch.
- Two-Stage Protocol:
- Stage 1 (Vision projector warmup): Only the projector is trained; MedGemma is frozen. Data: multilingual "clean report" samples; 35 epochs; learning rate .
- Stage 2 (Projector + LLM adapters): QLoRA (rank-16) is applied to MedGemma and the projector. Data: English instructions across all categories; 21 epochs; learning rate .
Data-Scaling Variants: ANTONI-α models were additionally trained on WSI subsets (2,000, 9,000, 18,000) to interrogate scaling behavior.
5. Evaluation and Performance Benchmarks
Evaluation benchmarks WSI-level VQA on 317 held-out samples:
| Model | Organ Score | Prec (%) | Rec (%) | F1 (%) | Diff. Diag Acc (%) |
|---|---|---|---|---|---|
| MedGemma-4B | 0.48 [0.43–0.53] | 71.4 [65.3–77.4] | 68.8 [62.7–74.9] | 70.1 [64.9–74.9] | 40.1 [34.7–45.4] |
| MedGemma-27B | 0.37 [0.32–0.42] | 85.5 [76.0–93.8] | 24.3 [18.7–30.1] | 37.9 [30.3–45.0] | 44.8 [39.1–50.2] |
| ANTONI-α (Base, no ft) | 0.52 [0.47–0.58] | 60.3 [53.1–67.4] | 50.9 [44.3–57.4] | 55.2 [49.2–60.7] | 48.3 [42.9–53.6] |
| ANTONI-α (2k WSIs) | 0.66 [0.60–0.71] | 68.7 [63.6–73.7] | 99.5 [98.6–100] | 81.3 [77.6–84.7] | 52.7 [47.0–58.4] |
| ANTONI-α (9k WSIs) | 0.91 [0.88–0.94] | 70.9 [65.6–76.0] | 95.0 [91.8–97.7] | 81.2 [77.3–84.7] | 66.3 [60.9–71.3] |
| ANTONI-α (18k WSIs) | 0.91 [0.88–0.94] | 72.9 [67.5–78.2] | 91.3 [87.4–94.8] | 81.1 [77.1–84.7] | 68.5 [63.1–73.5] |
Key Empirical Findings:
- Native WSI feature encoding improves organ identification (from 0.48 to 0.91) and differential diagnosis accuracy (from 44.8% to 68.5%) over prior models.
- Organ score scales steeply with data: from 0.66 to 0.91 when increasing from 2,000 to 9,000 slides.
- Fine-tuned ANTONI-α surpasses both MedGemma-4B and MedGemma-27B on most metrics.
6. Representative Model Output
A sample dialogue (paraphrased for brevity) on a dermatofibroma case demonstrates the model's clinical reasoning:
- User: “Which organ/tissue is shown on this slide?”
- ANTONI-α: “This is a section of skin, specifically dermis with a well-circumscribed fibrohistiocytic lesion (dermatofibroma).”
- User: “Is there a neoplasm present?”
- ANTONI-α: “Yes. The lesion is benign but neoplastic, composed of spindle-shaped fibroblasts and histiocytes without significant atypia or mitotic activity.”
- User: “Select most likely diagnosis: 1) Dermatofibroma, 2) Dermatofibrosarcoma protuberans, 3) Hypertrophic scar.”
- ANTONI-α: “1) Dermatofibroma. The lesion’s well-circumscription, absence of storiform pattern, and lack of deep infiltration support a benign fibrohistiocytic neoplasm.”
MedGemma, when presented with a downsampled thumbnail, confuses boundaries and cellular features, yielding incorrect diagnostic output.
7. Codebase and Resources
All core resources are fully open-source:
- Model code and weights: https://github.com/computationalpathologygroup/ANTONI-Alpha
- Polysome instruction-generation toolkit: https://github.com/computationalpathologygroup/Polysome
- HISTAI-Instruct dataset: https://huggingface.co/datasets/SaltySander/HISTAI-Instruct
These resources facilitate fully reproducible experimentation, from WSI preprocessing through instruction synthesis to model training and evaluation (Moonemans et al., 19 Dec 2025).