JHH Dataset: Pancreatic Tumor CT Scan Annotations
- JHH dataset is a curated collection of 3,000 pancreatic tumor CT scans with expert per-voxel annotations validated against pathology reports.
- The dataset reveals that segmentation performance, measured by DSC, saturates after 1,500 real scans, indicating diminishing returns for added annotations.
- Synthetic tumor augmentation using DiffTumor significantly reduces the need for real annotations by achieving similar segmentation performance with fewer scans.
The JHH dataset refers to a collection of 3,000 expertly annotated pancreatic tumor CT scans, established as a proprietary resource for developing and benchmarking AI models for medical image segmentation. Derived over five years with per-voxel expert curation and validation against pathology reports, the dataset plays a foundational role in understanding data scaling laws in tumor segmentation and underpins the construction and strategy of the larger, multi-organ AbdomenAtlas 2.0 dataset (Chen et al., 16 Oct 2025).
1. Dataset Composition and Annotation Protocol
The JHH dataset consists of 3,000 high-resolution computed tomography (CT) scans, each annotated on a per-voxel basis for the presence and extent of pancreatic tumors. All annotations were performed by expert radiologists, with subsequent validation against pathology reports to ensure alignment with clinical realities. This annotation process guarantees that tumor boundaries reflect medical ground truth and provides a high-fidelity resource for supervised learning.
The dataset exclusively targets pancreatic tumors. Each scan includes precise binary or graded voxel masks indicating tumor presence, suitable for downstream use in pixel- or voxel-wise segmentation tasks.
2. Empirical Findings on Data Scaling and Model Performance
A key insight obtained from controlled experiments on the JHH dataset is the identification of rapid saturation in segmentation performance as measured by the Dice Similarity Coefficient (DSC) and Normalized Surface Dice (NSD). Specifically, when training state-of-the-art segmentation networks, the mean performance on in-distribution test sets (scans with similar characteristics as training data) rapidly improves as the number of annotated training scans increases, but reaches a plateau after acquiring approximately 1,500 real pancreatic tumor scans.
This plateau suggests that, beyond a critical dataset size, further annotation yields diminishing improvements to model generalizability and accuracy for within-distribution samples. The following succinctly formalizes the scaling observation:
Real Training Scans | In-distribution DSC Gain |
---|---|
500 | Rapid increase |
1,500 | Performance saturates |
3,000 | Marginal improvement |
3. Synthetic Tumor Augmentation Strategy
To address the annotation bottleneck and explore alternatives to labor-intensive expert labeling, the paper introduces synthetic tumor augmentation using a generative system called DiffTumor. Synthetic tumors are algorithmically generated and inserted into normal scans, producing corresponding segmentation masks. The generation process targets a controlled lesion size spectrum (4:2:1 ratio of small, medium, and large lesions), ensuring diversity across the augmented dataset.
The experimental results show that incorporating synthetic tumors into the training set leads to a significant reduction in the required amount of real annotated data for equivalent segmentation performance. Specifically, a model trained with both real and synthetic data achieves in-distribution DSC comparable to a real-data-only model trained on 1,500 scans by using only 500 real scans. This finding demonstrates that synthetic data can dramatically steepen the data scaling curve:
Training Paradigm | Real Scans Needed for Plateau |
---|---|
Real data only | 1,500 |
Real + synthetic augment. | 500 |
Synthetic augmentation introduces diversity in lesion appearance, location, and size, and ensures robust convergence without sacrificing annotation quality.
4. Influence on Multiorgan AbdomenAtlas 2.0 Dataset Construction
The lessons from the JHH dataset have informed the methodology and rationale for the development of the AbdomenAtlas 2.0 dataset, which encompasses six abdominal organs (pancreas, liver, kidney, colon, esophagus, uterus) and includes over 10,000 CT scans with 15,130 voxel-wise annotated tumor instances and nearly 6,000 control scans. Annotation was conducted by 23 expert radiologists. Training on AbdomenAtlas 2.0 has yielded measurable performance gains in both in-distribution (+7% DSC) and out-of-distribution (+16% DSC) settings, compared to existing public resources.
The JHH experience establishes empirical scaling laws in real medical data tasks and demonstrates the essential role of data diversity—particularly for generalization across different institutions and acquisition protocols.
5. Implications for AI-Based Medical Image Analysis
The plateau of segmentation performance with increasing real data highlights the limited returns of exhaustive manual annotation for in-distribution generalization. However, for out-of-distribution scenarios (e.g., scans from different hospitals or imaging protocols), incremental data volume and diversity continue to yield improvements.
The use of synthetic data, validated via the JHH cohort, suggests that well-designed synthetic augmentation can substitute for a significant portion of expensive real annotation in training robust models. This strategy implicates a paradigm shift in dataset construction for medical imaging AI, promising efficiency without sacrificing clinical utility.
Further, the adoption of semi-automatic annotation and expert review pipelines (such as the SMART-Annotator approach, which combines AI pre-labelling with rapid human revision) reduces annotation time per scan from minutes to seconds. This could be critical for expanding datasets to new organs, modalities, or rare phenotypes.
6. Future Directions and Ongoing Expansion
Future efforts are oriented toward further increasing dataset diversity by expanding both JHH and AbdomenAtlas 2.0 to encompass scans from multiple centers and additional imaging protocols. This expansion is expected to improve the generalizability of models, especially in heterogeneous and rare cases.
The demonstrated efficiency of synthetic augmentation also provides a compelling case for broader adoption of generative data strategies in medical image analysis, with a focus on benchmarking the impact of real versus synthetic data across tumor types, modalities, and clinical endpoints.
A plausible implication is that a hybrid annotation paradigm—combining small, expertly curated core datasets with large volumes of procedurally generated data—may become the dominant approach for future large-scale medical imaging challenges.