Papers
Topics
Authors
Recent
2000 character limit reached

ANTONI-α: Whole-Slide VLM in Pathology

Updated 26 December 2025
  • ANTONI-α is a whole-slide vision-language model designed for visual-question answering in pathology, integrating high-resolution tile encoding with multimodal fusion.
  • It leverages a ResNet-style tile encoder and a single-layer cross-attention vision projector to efficiently map gigapixel images to the language model input space.
  • Training on synthetic instructions and large-scale H&E datasets, ANTONI-α achieves superior organ identification and differential diagnosis accuracy compared to previous models.

ANTONI-α is a whole-slide vision-LLM (VLM) developed to facilitate visual-question answering (VQA) in pathology at the slide level. It natively processes gigapixel-scale hematoxylin and eosin (H&E) stained whole-slide images (WSIs), aligning detailed visual context with multimodal clinical task responses. Built from open components and trained entirely on publicly available synthetic instruction data, ANTONI-α establishes a reproducible benchmark for generalizable, clinically relevant VLMs in digital pathology, surpassing prior models such as MedGemma on key diagnostic tasks (Moonemans et al., 19 Dec 2025).

1. Model Architecture

ANTONI-α integrates four principal modules:

  • VIRCHOW (Tile Encoder): A ResNet-style convolutional neural network pretrained on clinical histopathology providing 1280-dimensional embeddings per tile. Its parameters (~350M) remain frozen during instruction tuning.
  • PRISM/CoCa (Slide Aggregator): Aggregates 1 global and 512 local tokens (all 1280-dimensional) using image–text contrastive and generative captioning objectives. Model size is ~400M parameters and weights are frozen for ANTONI-α fine-tuning.
  • Vision Projector: A custom, single-layer cross-attention module with 256 learnable query tokens (QR256×3072Q \in \mathbb{R}^{256 \times 3072}) attending to 513 vision latents (VR513×1280V \in \mathbb{R}^{513 \times 1280}). This single layer maps visual information into the LLM input space. The projector is the principal trainable component in early-stage tuning (~2.5M parameters).
  • MedGemma-4B-IT (LLM Decoder): A 4B parameter transformer, instruction-tuned for medical applications, with a hidden size of 3072 and 32 layers. The SigLIP input mechanism is ablated; instead, the decoder receives the 256 vision tokens from the projector.

Data flow: WSI → VIRCHOW (tile embeddings) → PRISM (aggregation: 1 global + 512 locals) → Vision projector (cross-attention, 256 tokens) → MedGemma-4B-IT → text output.

Key Design Choices:

  • High-resolution tile embeddings without downsampling artifacts.
  • Single-layer cross-modal alignment for efficient resource use.
  • Two-stage finetuning: (1) projector warmup, (2) projector+LLM joint QLoRA adaptation.

2. Training Objectives and Loss Functions

ANTONI-α employs a standard next-token cross-entropy loss for instruction tuning:

LCE=1Tt=1TlogP(ytx1:t1)\mathcal{L}_{\mathrm{CE}} = -\frac{1}{T} \sum_{t=1}^T \log P(y_t | x_{1:t-1})

where x1:t1x_{1:t-1} are the prompt, user, and image tokens, and yty_t is the ground-truth assistant token at step tt.

PRISM Pretraining Objectives (Contextual, Frozen in ANTONI-α):

  • Image–text contrastive loss (CLIP-style):

Lcontrastive=1Bi=1Blogexp(sim(vi,ti)/τ)j=1Bexp(sim(vi,tj)/τ)\mathcal{L}_{\mathrm{contrastive}} = -\frac{1}{B} \sum_{i=1}^B \log \frac{\exp(\mathrm{sim}(v_i,t_i)/\tau)}{\sum_{j=1}^B \exp(\mathrm{sim}(v_i, t_j)/\tau)}

  • Captioning loss: As above, cross-entropy over text targets.

These objectives ensure robust multimodal alignments during pretraining of aggregator components.

3. Training Data and Synthetic Instruction Generation

The primary training resource is HISTAI-Instruct, derived via a synthetic pipeline over the public HISTAI archive:

  • WSIs Post-Filtering: 24,259 total, with 22,530 curated for H&E tuning.
  • Instruction–Response Pairs: Initially 1,118,691, filtered via LLM-based judge to 1,175,524 high-quality instances.
  • Conversational Categories:
    • Advanced reasoning
    • Clean report
    • Detailed description
    • Differential diagnosis
    • Multi-turn conversation
    • Negative reasoning
    • Short VQA

Polysome Pipeline:

  1. Metadata (descriptions, ICD-10, conclusions) is injected into prompt templates.
  2. English instructions and corresponding responses are generated in all 7 categories.
  3. Instances are machine-translated into Dutch, French, German, Italian, Polish, and Spanish.
  4. “LLM-as-Judge” filtering is applied for adherence, factuality, and reasoning quality.
  5. Question variation increases factual coverage and robustness.

4. Training Regimen and Hyperparameters

Hardware configuration utilizes 8 NVIDIA H200 GPUs, FSDP, and mixed bfloat16 precision.

  • Optimizer: AdamW, weight decay 0.01.
  • Learning Rate: Cosine schedule with 10% linear warmup.
  • Batch Size: 16 sequences/GPU × 8 GPUs × 4 gradient accumulation steps = 512 effective batch.
  • Two-Stage Protocol:
    • Stage 1 (Vision projector warmup): Only the projector is trained; MedGemma is frozen. Data: multilingual "clean report" samples; 35 epochs; learning rate 3×1043\times10^{-4}.
    • Stage 2 (Projector + LLM adapters): QLoRA (rank-16) is applied to MedGemma and the projector. Data: English instructions across all categories; 21 epochs; learning rate 3×1053\times10^{-5}.

Data-Scaling Variants: ANTONI-α models were additionally trained on WSI subsets (2,000, 9,000, 18,000) to interrogate scaling behavior.

5. Evaluation and Performance Benchmarks

Evaluation benchmarks WSI-level VQA on 317 held-out samples:

Model Organ Score Prec (%) Rec (%) F1 (%) Diff. Diag Acc (%)
MedGemma-4B 0.48 [0.43–0.53] 71.4 [65.3–77.4] 68.8 [62.7–74.9] 70.1 [64.9–74.9] 40.1 [34.7–45.4]
MedGemma-27B 0.37 [0.32–0.42] 85.5 [76.0–93.8] 24.3 [18.7–30.1] 37.9 [30.3–45.0] 44.8 [39.1–50.2]
ANTONI-α (Base, no ft) 0.52 [0.47–0.58] 60.3 [53.1–67.4] 50.9 [44.3–57.4] 55.2 [49.2–60.7] 48.3 [42.9–53.6]
ANTONI-α (2k WSIs) 0.66 [0.60–0.71] 68.7 [63.6–73.7] 99.5 [98.6–100] 81.3 [77.6–84.7] 52.7 [47.0–58.4]
ANTONI-α (9k WSIs) 0.91 [0.88–0.94] 70.9 [65.6–76.0] 95.0 [91.8–97.7] 81.2 [77.3–84.7] 66.3 [60.9–71.3]
ANTONI-α (18k WSIs) 0.91 [0.88–0.94] 72.9 [67.5–78.2] 91.3 [87.4–94.8] 81.1 [77.1–84.7] 68.5 [63.1–73.5]

Key Empirical Findings:

  • Native WSI feature encoding improves organ identification (from 0.48 to 0.91) and differential diagnosis accuracy (from 44.8% to 68.5%) over prior models.
  • Organ score scales steeply with data: from 0.66 to 0.91 when increasing from 2,000 to 9,000 slides.
  • Fine-tuned ANTONI-α surpasses both MedGemma-4B and MedGemma-27B on most metrics.

6. Representative Model Output

A sample dialogue (paraphrased for brevity) on a dermatofibroma case demonstrates the model's clinical reasoning:

  • User: “Which organ/tissue is shown on this slide?”
    • ANTONI-α: “This is a section of skin, specifically dermis with a well-circumscribed fibrohistiocytic lesion (dermatofibroma).”
  • User: “Is there a neoplasm present?”
    • ANTONI-α: “Yes. The lesion is benign but neoplastic, composed of spindle-shaped fibroblasts and histiocytes without significant atypia or mitotic activity.”
  • User: “Select most likely diagnosis: 1) Dermatofibroma, 2) Dermatofibrosarcoma protuberans, 3) Hypertrophic scar.”
    • ANTONI-α: “1) Dermatofibroma. The lesion’s well-circumscription, absence of storiform pattern, and lack of deep infiltration support a benign fibrohistiocytic neoplasm.”

MedGemma, when presented with a downsampled thumbnail, confuses boundaries and cellular features, yielding incorrect diagnostic output.

7. Codebase and Resources

All core resources are fully open-source:

These resources facilitate fully reproducible experimentation, from WSI preprocessing through instruction synthesis to model training and evaluation (Moonemans et al., 19 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to ANTONI-α.