HistGen Benchmark in Computational Pathology

Updated 2 December 2025

HistGen is a benchmark comprising thousands of WSI–report pairs from TCGA for automated report generation in computational pathology.
It standardizes evaluation using metrics like BLEU, METEOR, and ROUGE-L to compare vision-language models in clinical diagnostics.
The resource supports downstream tasks such as cancer subtyping and survival prediction, enhancing translational research and model transferability.

HistGen is a large-scale, open benchmark for automated histopathology whole-slide image (WSI) report generation and associated downstream clinical tasks. It provides a rigorously curated, richly annotated resource and standardized evaluation protocol for assessing report generation models that bridge vision and language in computational pathology. The benchmark supports development and comparison of algorithms targeting diagnostic report synthesis directly from gigapixel WSIs, a clinically relevant frontier with direct translational implications for cancer diagnosis, subtyping, and prognosis.

1. Dataset Composition and Construction

HistGen comprises a total of 7,690–7,753 WSI–report pairs, drawn from The Cancer Genome Atlas (TCGA) collection, depending on version (Guo et al., 8 Mar 2024, Liu et al., 21 Jun 2025). Each WSI is associated with a fully extracted and cleaned clinical diagnostic report, linked at the case level, and is labeled with one of 30–32 disease (cancer) categories, spanning major organ systems and cancer types such as BRCA (breast), KIRC (renal), LUAD (lung), and rare subtypes with as few as 37 slides.

Data Splits

The canonical split consists of 80% training (6,152 slides), 10% validation (769), and 10% test (769), ensuring class-distribution preservation for robust and fair evaluation. The resource also supports downstream clinical tasks with similarly stratified splits.

Annotation and Preprocessing

Source acquisition: Diagnostic reports are extracted from TCGA PDF records. Raw text is denoised and condensed using GPT-based summarization (GPT-4 in (Guo et al., 8 Mar 2024)).
WSI–report matching: Each report is mapped to a unique WSI using case IDs. Cases with ambiguous multiple mappings are excluded.
Patch extraction: WSIs at level 0 (uncompressed gigapixel resolution) are divided into $N$ regions, each with $S=96$ patches; each patch is $512\times512$ pixels, subsequently resized to $224\times224$ for feature extraction.
Visual encoder pretraining: Patch encodings leverage DINOv2-pretrained ViT-L, which itself was pretrained on $\sim$ 60,000 WSIs from 60+ primary sites.
Normalization: Mean/variance normalization per patch is applied.

2. Tasks and Evaluation Protocols

Primary Tasks

Whole-Slide Image to Report Generation: Synthesize a free-text clinical diagnostic report given a WSI as input.
Cancer Subtyping: Assign subtype labels on external datasets (UBC-OCEAN, Camelyon, TUPAC16).
Survival Time Prediction: Predict patient survival time on select TCGA cohorts (BRCA, STAD, KIRC, KIRP, LUAD, COADREAD).

Report Generation Metrics

Evaluation focuses on the generated report's first 100 tokens (matching the truncated reference), using:

BLEU- $n$ (1 $\leq n \leq$ 4): Measures $n$ -gram precision with a brevity penalty:

$BP = \begin{cases}1 & |C| > |R| \ e^{(1 - |R|/|C|)} & |C| \leq |R| \end{cases}$

$p_n = \frac{\sum_{g \in n\text{-grams}} \min(\mathrm{count}_C(g), \mathrm{count}_R(g))}{\sum_{g \in n\text{-grams}} \mathrm{count}_C(g)}$

$\mathrm{BLEU} = BP \exp \left( \sum_{n=1}^{N} w_n \log p_n \right)$

METEOR: Fuses unigram recall, precision, and a fragmentation penalty:

$F_{mean} = \frac{(1 - \mathrm{Pen}) (10\,P\,R)}{R + 9\,P}$

ROUGE-L: Longest Common Subsequence F-measure:

$R_l = \frac{\mathrm{LCS}(C, R)}{|R|}, \quad P_l = \frac{\mathrm{LCS}(C, R)}{|C|}$

$\mathrm{ROUGE\mathrm{-}L} = \frac{(1+\beta^2)P_l R_l}{R_l + \beta^2 P_l} \quad (\beta=1)$

Exact Entity Match Reward (fact_ENT): Measures clinical entity coverage via BioBERT-based NER (Liu et al., 21 Jun 2025).

Downstream Metrics

Cancer Subtyping: Accuracy, AUC.
Survival Prediction: Concordance index (c-index):

$\mathrm{c\mbox{-}Index} = \frac{1}{|\{i<j\}|} \sum_{i<j} \mathbb{I}(\hat h_i < \hat h_j \, \wedge \, t_i < t_j)$

3. Baseline Models and SOTA Approaches

The baseline (R2GenCMN) and initial HistGen model architecture establish performance expectations for report generation on this benchmark (Guo et al., 8 Mar 2024):

Model	BLEU-1	BLEU-4	METEOR	ROUGE-L
R2GenCMN	0.381	0.157	0.164	0.313
HistGen	0.413	0.184	0.182	0.344

HistGen achieves an average absolute improvement of ~3% across all core metrics over the best prior method.

PathGenIC, a multimodal in-context learning framework, sets the new state of the art (Liu et al., 21 Jun 2025), with:

BLEU-1: 0.431 (vs. 0.411 baseline)
BLEU-4: 0.196 (vs. 0.184 baseline)
METEOR: 0.197 (vs. 0.184 baseline)
ROUGE-L: 0.357 (vs. 0.344 baseline)
fact_ENT: 0.462 (vs. 0.445 baseline)

Performance gains are also confirmed in entity coverage and robustness to output length or disease class.

4. Model Architectures Supported

HistGen (Local-Global Hierarchical Encoding, CMC)

Local-Global Hierarchical Encoder: WSIs decomposed into regions and patches. Patch embeddings (via DINOv2 ViT-L) obtain position encodings, processed by region-level and then slide-level encoders ( $E_l$ , $E_g$ ). Global context is reintegrated into region encodings, pooled, and provided as input to the decoder.
Cross-Modal Context Module: An external memory $C = \{C_1, ..., C_m\} \in \mathbb{R}^{m \times d}$ supports alignment and cross-attention between selected key visual patches and textual memory, informing generation.

PathGenIC (Multimodal In-Context Learning)

Visual Encoder: As above, but with m transformer blocks applied to query tokens and ViT-L patch features, projected to obtain the WSI embedding $\mathbf{H}$ .
Prompt Assembly and In-Context Learning: Combinations of retrieved nearest-neighbor tokens/reports, class-guideline distilled texts (via GPT-4o), and feedback strings are provided to the pretrained vision–LLM Quilt-LLaVA.
Learning: Only the patch-to-token transformers and a LoRA adapter on the VLM are fine-tuned. All other parameters of the VLM are frozen.

5. Training and Inference Procedures

Data Handling

Patching: WSIs are tiled, normalized, and embedded.
Tokenization: Reports are tokenized, truncated to 512 tokens.
Batching: Training is performed with batch size 8 for 20 epochs using Adam with initial learning rate 1e-4 and cosine decay.

Inference Protocol

Extract $\mathbf{F}_{patch} \to \mathbf{H}_{test}$ .
Retrieve in-context elements (nearest neighbors, class-guidelines, feedback).
Assemble multimodal prompt.
Autoregressively generate $\mathbf{Y}_{gen}$ , evaluated on the first 100 tokens.

6. Experimental Outcomes and Robustness

Main Results

PathGenIC demonstrates state-of-the-art performance with significant metric improvements, particularly in BLEU, METEOR, ROUGE-L, and fact_ENT. Ablation studies indicate optimal performance with $K=3$ neighbors for retrieval. Each component—nearest neighbor retrieval, category guideline, and feedback mechanisms—incrementally enhances report quality, with the best results when all are combined.

Robustness

Sequence Length: Metrics plateau or decline gently beyond 200 tokens, with BLEU-1 showing a modest increase up to 200.
Disease Category: BLEU-1/4 metrics vary, with highest performance on well-represented diseases (e.g., low-grade glioma BLEU-1 ≈ 0.50) and lower performance on rare subtypes (<0.40).
Error Analysis: High-quality generations rarely miss clinical entities. Errors typically stem from omissions (e.g., missing "periaortic lymph nodes") rather than hallucinations.

7. Benchmark Availability and Extensions

Resources: Code, pre-trained models, and dataset scripts are available at https://github.com/dddavid4real/HistGen. Diagnostic report texts can be accessed via the TCGA GDC portal.
Extensibility: HistGen supports reproduction of reported results, fine-tuning on novel cohorts, and adaptation to additional pathology domains.
Transfer Learning: HistGen models, when fine-tuned for cancer subtyping or survival, yield superior accuracy/AUC and c-index compared to multiple-instance learning baselines, demonstrating transferability.

The HistGen benchmark has established itself as the principal standard for end-to-end WSI report generation in computational pathology, enabling robust comparison of architectures and fostering methodological progress in vision-LLMs for medical AI (Guo et al., 8 Mar 2024, Liu et al., 21 Jun 2025).

PDF Markdown Chat (Pro)

References (2)

HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction (2024)

Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to HistGen Benchmark.