HISTAI-Instruct: WSI Vision-Language Dataset
- HISTAI-Instruct is an open, large-scale multi-turn instruction–response dataset that supports robust WSI vision–language modeling in digital pathology.
- It employs a modular pipeline with LLM-driven data generation, quality filtering, and multilingual diversification to ensure reproducibility and clinical relevance.
- The dataset benchmarks diagnostic reasoning and VQA performance, offering a methodological blueprint for advancing medical AI systems.
HISTAI-Instruct is an open, large-scale, multi-turn instruction–response dataset for whole-slide image (WSI) vision–language modeling in digital pathology. Developed as part of the HISTAI project (Moonemans et al., 19 Dec 2025), HISTAI-Instruct is designed to enable and rigorously assess instruction-following medical AI systems in WSI-level visual question answering, diagnostic reasoning, and report generation. Serving as both a dataset and a methodological blueprint, HISTAI-Instruct is referenced alongside technical advances in automatic dataset generation (Polysome (Moonemans et al., 19 Dec 2025)), spatial grounding in pathology (Quilt-Instruct (Seyfioglu et al., 2023)), and general paradigms for instruction encoding and history-aware reasoning (Guhur et al., 2022, Jeon et al., 23 Jul 2024).
1. Motivation and Scope
HISTAI-Instruct addresses the need for reproducible, clinically relevant, and richly annotated instruction–tuning corpora in computational pathology. Previous vision–LLMs for medical slides focused either on small crops, lacked end-to-end reportability, or depended on private, non-reproducible clinical data. HISTAI-Instruct is the first large-scale, public multi-turn dataset—>1.1 million instruction–response pairs spanning 24,259 H&E WSIs from the multicentric HISTAI archive (Moonemans et al., 19 Dec 2025). It incorporates raw image evidence, structured textual metadata, and linguistically diverse dialogue acts, facilitating instruction tuning for both generalist and medical LMs.
HISTAI-Instruct is structured to support seven conversational competencies that reflect the core reasoning, reporting, and interactive skills demanded of digital pathology AI:
| Category | Example Instruction | Example Response |
|---|---|---|
| Short VQA | Which organ is this slide from? | This slide depicts skin tissue. |
| Neoplasm detection | Is there a neoplasm present? [Yes/No] | Yes. Nesting basaloid cells... |
| Differential diagnosis | Most likely diagnosis? {BCC, SCC, melanoma} | Basal cell carcinoma... |
| Clean report | Generate a JSON report with sections... | {'Microscopy':..., 'Diagnosis': ...} |
| Advanced reasoning | Explain features supporting diagnosis | Morphological evidence is... |
| Multi-turn conversations | Maintain context in successive turns | As previously noted, ... |
| Negative reasoning | Reject queries with insufficient evidence | Insufficient features are present. |
The dataset supports multilingual instruction tuning by translating all English pairs into six European languages (Moonemans et al., 19 Dec 2025).
2. Data Generation Pipeline: Polysome
Polysome is the modular pipeline underlying HISTAI-Instruct (Moonemans et al., 19 Dec 2025). Its key modules:
- Data Ingestion: Ingests WSI-linked structured metadata, including microscopic descriptions, diagnostic conclusions, and patient demographics.
- Prompt-Driven LLM Generation: Templates wrap metadata into user–assistant–system conversational turns. Generation uses a conversational LLM (Gemma-3-27B-IT).
- Quality Filtering (LLM-as-Judge): Each instruction–response pair is scored for groundedness, factuality, and clarity. Sub-threshold outputs are discarded.
- Linguistic Diversification: High-frequency prompts and responses are paraphrased; English outputs are translated into Dutch, French, German, Italian, Polish, and Spanish.
- Final Assembly: The cleaned and diversified corpus is exported as HISTAI-Instruct.
This pipeline yields 1,175,524 high-quality multilingual pairs from an initial set of 1,188,691. Filters enforce one-slide-per-report and require both microscopic description and conclusion (Moonemans et al., 19 Dec 2025).
3. Data Composition and Structure
The base HISTAI archive comprises 112,801 WSIs from 47,000 cases, with 46,128 containing rich metadata (Moonemans et al., 19 Dec 2025). Downstream filtering yields 24,259 unique slides. Distribution among organs (approximate): skin 44%, breast 29%, colon 15%, plus lymph node, lung, and soft tissue.
Each HISTAI-Instruct instance is a three-part dialogue:
- System: Static tag (e.g., <slide image>)
- User: Instruction or question, templated per conversational category
- Assistant: Response, which may be concise (e.g., for VQA), multi-sentence (e.g., advanced reasoning), or structured (e.g., JSON report).
Data are split into training/validation (22,530 cases, H&E only) and a held-out test set (317 cases, 16 organs). Free-text diagnostic rationales are post-processed by Gemini 2.5 Flash into discrete classes for evaluation (Moonemans et al., 19 Dec 2025).
4. Instruction Tuning Regimen and Model Integration
Instruction tuning using HISTAI-Instruct is demonstrated with ANTONI-α, an open VLM for WSI diagnosis and VQA (Moonemans et al., 19 Dec 2025):
- Vision Encoder: Virchow (tile-level, 1280-D) and PRISM CoCa (slide-level, 513 latent tokens).
- Vision–Language Bridge: One-layer cross-attention projector, 256 learnable queries to match LLM hidden dimension (3072-D for MedGemma-4B-IT). Outputs injected as prepended <image> tokens.
- LLM Backbone: MedGemma-4B-IT.
- Training:
- Stage 1: Freeze LLM; train projector on “Clean report” instructions (35 epochs, LR=3e-4).
- Stage 2: QLoRA on all linear layers, projector + LLM, across all tasks (21 epochs, LR=3e-5).
- Batching: 8 × NVIDIA H200, per-GPU 16, gradient_accumulation=4.
Losses use cross-entropy at the assistant-token level:
5. Quantitative Evaluation and Benchmarking
WSI-level VQA performance is evaluated on 317 test cases (951 instruction–response pairs):
| Model | Organ Score | Prec (%) | Rec (%) | F1 (%) | Diff. Acc (%) |
|---|---|---|---|---|---|
| Random | – | 68.77 | 50.00 | 57.90 | 26.89 |
| MedGemma-4B | 0.48 | 71.43 | 68.81 | 70.09 | 40.06 |
| MedGemma-27B | 0.37 | 85.48 | 24.31 | 37.86 | 44.79 |
| ANTONI-α (Base) | 0.52 | 60.33 | 50.92 | 55.22 | 48.26 |
| ANTONI-α (2k tuning cases) | 0.66 | 68.67 | 99.54 | 81.27 | 52.68 |
| ANTONI-α (9k) | 0.91 | 70.89 | 94.95 | 81.18 | 66.25 |
| ANTONI-α (18k) | 0.91 | 72.89 | 91.28 | 81.06 | 68.45 |
Scaling ablation reveals marked improvement in organ identification (0.66→0.91) and differential diagnosis accuracy (~53%→~68%) as tuning size increases (Moonemans et al., 19 Dec 2025). ANTONI-α outperforms MedGemma across all measured VQA metrics.
6. Methodological Context and Best Practices
HISTAI-Instruct’s approach integrates several best practices and methodological recommendations from the broader instruction-tuning and multi-modal learning literature:
- Automatic Quality Control: LLM-as-Judge filtering and paraphrastic diversification (Moonemans et al., 19 Dec 2025).
- Multi-field Prompt Templates: Systematic coverage of diagnostic competencies, as in Quilt-Instruct (Seyfioglu et al., 2023).
- Categorization and Consensus Filtering: Ensemble-Instruct highlights the gains from task-type stratification and multi-LM consensus ensembling to maximize data utility for instruction-tuned LMs (Lee et al., 2023).
- Spatio-Contextual Reasoning: Prior art emphasizes spatial grounding via cursor-location and multi-patch chains in histopathology (Seyfioglu et al., 2023).
Generalization principles advocate: explicit task-type splits, concise prompt templates, sample efficiency (quality over quantity), and multi-language diversification. These design principles support the extensibility of HISTAI-Instruct to other medical imaging modalities or clinical QA domains.
7. Research Implications and Prospects
HISTAI-Instruct demonstrates that scalable instruction-tuning datasets, generated via modular pipelines and quality-controlled by LLMs, enable state-of-the-art open models to achieve high accuracy and robustness in complex medical VQA and diagnostic tasks. It establishes a public benchmark for whole-slide instruction following and provides a path for domain adaptation in both language and vision–LLMs. By incorporating both short-form (VQA) and long-form (report, multi-turn) dialogue, the dataset supports comprehensive evaluation of clinical reasoning, spatial awareness, and context retention.
Potential future directions include integrating spatio-temporal cursor/annotation data directly into prompts, extending instruction-tuning to multi-modal continuous-reporting regimes, and developing more nuanced, uncertainty-aware evaluation metrics for real-world clinical deployment (Moonemans et al., 19 Dec 2025, Seyfioglu et al., 2023).