Pan-Cancer Imaging Dataset
- Pan-cancer imaging datasets are large-scale, multi-institutional collections spanning diverse cancers with comprehensive organ and lesion annotations.
- They integrate multiple imaging modalities and standardized preprocessing to support AI benchmarking in diagnosis, segmentation, synthesis, and prognosis tasks.
- Detailed metadata and reproducible code pipelines facilitate robust comparative studies and generalizable model development across global populations.
Pan-cancer medical imaging datasets comprise large-scale, multi-institutional collections of digital scans, often annotated, that span diverse human cancers and, in advanced resources, integrate multi-modal data for benchmarking AI approaches in diagnosis, segmentation, prognosis, and synthesis tasks. These datasets provide organ-level and lesion-level annotations, cross-organ imaging coverage, standardized preprocessing workflows, and explicit metadata to support robust comparative studies and generalizable model development across global patient populations.
1. Dataset Composition and Cancer-Type Coverage
Pan-cancer imaging datasets are characterized by their scale, cancer diversity, and institutional breadth. The FLARE 2023 Challenge’s abdominal CT dataset, for instance, consists of 4,650 CT scans from over 40 centers, labeled for both abdominal organ segmentation (13 target regions) and “pan-cancer” lesions, meaning all solid tumors and metastases are annotated as a single class without per-type stratification (Ma et al., 2024). This design facilitates unbiased lesion segmentation and algorithmic generalization across heterogeneous cancer morphologies. Testing data included unseen institutions from Asia to enforce cross-center and cross-regional robustness.
PETWB-REP extends this paradigm to multimodal functional imaging, aggregating 490 whole-body FDG PET/CT scans with clinical and report metadata. The cohort spans five major cancers—lung (34.29%), liver (10.00%), breast (3.47%), prostate (2.45%), ovarian (4.08%)—plus several other types with patient-level demographic, scan, and treatment parameters (Xue et al., 5 Nov 2025).
Other exemplars include the 5,720-patient, 14-cancer histopathology-omics repository from TCGA (Chen et al., 2021) and the 11-organ, fully paired CT/MR PMPBench resource for cross-modality synthesis (Chen et al., 22 Jan 2026).
2. Imaging Modalities, Acquisition, and Preprocessing
Modalities and acquisition protocols vary by dataset scope:
- Abdomen CT (FLARE 2023): Coverage spans all standard contrast phases—unenhanced, arterial, portal-venous, delayed. Scanner vendors include GE, Philips, Siemens, Toshiba. Heterogeneous slice thickness and in-plane resolution are preserved to maximize real-world diversity.
- Whole-body FDG PET/CT (PETWB-REP): PET acquisitions employ ¹⁸F-FDG with 3.70–5.55 MBq/kg dosing, standardized uptake (60 min), Siemens Biograph 64 PET/CT scanners, and multi-bed positions. PET reconstructions use OSEM with Gaussian smoothing; CT for attenuation correction employs 120 kV, 170 mA, and 3 mm slices. PET and CT volumes are co-registered via B-spline methods; PET is converted to SUVbw, CT is z-score normalized (Xue et al., 5 Nov 2025).
- Multi-phase CT/MRI (PMPBench): Each organ group comprises matched non-contrast and contrast-enhanced CT (CTC), or, for breast, dynamic contrast-enhanced MRI (DCE1–DCE3). Volumes are registered using a multi-stage Elastix-based pipeline to achieve sub-voxel alignment and harmonized to isotropic 1×1×1 mm³ voxels. CT intensities are min-max normalized after windowing to [–200, 300] HU; MR DCE is z-score normalized (Chen et al., 22 Jan 2026).
- Histology-Omics (PORPOISE): WSIs are formalin-fixed, paraffin-embedded, H&E-stained, and digitized as pyramidal TIFFs. Tiles of 256×256 pixels are extracted at 20× magnification. RNA-seq, mutation, and copy-number features are processed with gene selection based on alteration frequency and variability (Chen et al., 2021).
File formats are standardized to NIfTI (.nii/.nii.gz) for scan volumes, with JSON or CSV metadata and label maps when annotated.
3. Annotation Protocols and Metadata
Annotation strategies differ according to modality and research purpose:
- FLARE 2023: Organs are delineated per RTOG consensus guidelines and Netter’s atlas. Lesions in tuning/testing sets are manually traced by senior radiologists using ITK-SNAP and MedSAM. Thirteen organs are included: liver, kidneys, spleen, pancreas, stomach, gallbladder, adrenal glands, aorta, IVC, esophagus, duodenum, and one other GI organ. Lesion annotation is a single binary mask for all abdominopelvic solid tumors, without stratification by histology (Ma et al., 2024).
- PETWB-REP: No manual segmentations or ROI masks are provided. Radiology reports, in both original Chinese and expert-validated English, include region-wise findings and impressions. The meta_data.csv supplies demographic and scan-level attributes such as age, sex, cancer type (ICD-style), body weight, injected FDG dose, uptake time, and scan timestamp (Xue et al., 5 Nov 2025).
- PMPBench: Inclusion necessitates complete CE/NCE images, verified phase labeling, and robust spatial correspondence. Radiologist and trained annotator reviews guarantee anatomical pairing. No lesion ROI masks are given; focus is on modality-paired whole-organ volumes (Chen et al., 22 Jan 2026).
- PORPOISE: Morphological attribution uses attention-based Multiple Instance Learning to score WSI patches, while molecular features are processed and interpreted via Integrated Gradients. Top markers are identified per disease and patient, cell-type distributions in high-attention regions are quantified by HoverNet (Chen et al., 2021).
4. Dataset Organization, Access, and Licensing
Datasets are organized for reproducible research:
| Dataset | Data Structure | Access & License |
|---|---|---|
| FLARE 2023 | Per-case folders: image.nii.gz, label.nii.gz, metadata.json | CodaLab; research-use only; original licenses |
| PETWB-REP | Imaging_data/sub_ID/CT/PET; Non-imaging CSVs | Zenodo; public DOI |
| PMPBench | Modality/Organ/PatientID; NIfTI headers carry metadata | GitHub; CC BY-NC-ND 4.0; controlled by source agreements |
| PORPOISE | WSI pyramidal TIFF, omics tables, attention maps | Web interface & GitHub; TCGA open-use |
Licensing is typically restricted to research use; commercial redistribution is precluded for FLARE 2023 and PMPBench, whereas PETWB-REP is available via Zenodo’s open repository, and PORPOISE follows TCGA/GDC terms.
5. Benchmarking Tasks, Metrics, and Baselines
Evaluation methodologies are tailored to imaging tasks:
- Segmentation (FLARE 2023): Employs Dice Similarity Coefficient (mean organ DSC = 92.3% ± 3.3; mean lesion DSC = 64.9% ± 27.4), Normalized Surface Dice (1 mm tolerance), instance metrics (Precision, Recall, F1), and Panoptic Quality (PQ) for lesion ensembles. Efficiency is assessed via per-case runtime and GPU-memory/time curve AUC (Ma et al., 2024).
- Radiomics (PETWB-REP): Suggested pipeline computes first-order voxel intensity statistics (mean, variance, skewness), gray-level co-occurrence matrix (GLCM) texture features (contrast, energy, homogeneity), and shape descriptors for downstream segmentation when masks are available (Xue et al., 5 Nov 2025).
- Synthesis (PMPBench): Defines 1→1, N→1, and 1→N translation settings, benchmarked by mean squared error (MSE), peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), LPIPS, FID, and KID. Baseline models include UNet-based HiNet, CycleGAN, and FlowMI, with FlowMI achieving superior PSNR and SSIM across settings (Chen et al., 22 Jan 2026).
- Prognosis (PORPOISE): Multimodal risk stratification evaluated by 5-fold cross-validation performance; model explanations validated by cross-referencing morphologic attention regions (e.g., tumor-infiltrating lymphocyte presence associating with favorable prognosis in 9/14 cancers) (Chen et al., 2021).
6. Codebases, Reproducibility, and Supplementary Data
Extensive code and containerized pipelines accompany pan-cancer datasets, enabling direct replication and baseline comparison. FLARE 2023 provides the top teams’ method repositories, including nnU-Net variants, transformer cascades, mean-teacher, and context-aware CutMix baselines (Python 3, PyTorch, Monai, Docker) (Ma et al., 2024). PETWB-REP authors recommend UNet for segmentation and transformer encoders for NLP on radiology reports (Xue et al., 5 Nov 2025). PMPBench supplies synthesis code (FlowMI, CycleGAN, HiNet) and curated data splits (Chen et al., 22 Jan 2026). PORPOISE centralizes end-to-end analysis pipelines (CLAM, Pathomic Fusion, HoverNet) and offers interactive visualization (Chen et al., 2021).
7. Applications, Limitations, and Future Directions
Pan-cancer imaging datasets underpin advancements in organ and lesion segmentation, multi-modal fusion, lesion conspicuity enhancement, synthetic contrast generation, prognostic modeling, and multi-source harmonization. However, limitations persist:
- Absence of per-tumor-type lesion annotation in CT datasets prevents histologic stratification.
- Lack of manual lesion masks in PETWB-REP and PMPBench restricts evaluation in direct lesion localization tasks; motion and protocol heterogeneity introduce biases.
- Metadata incompleteness (e.g., patient demographics or slice thickness distributions) limits some statistical analyses.
- Expansion to further modalities (e.g., PET in PMPBench, multi-parametric MRI) and incorporation of comprehensive lesion annotations and inter-annotator statistics are identified for future releases.
Ongoing efforts are focused on integrating additional imaging modalities, enlarging cohorts, and enhancing annotation granularity to broaden clinical and algorithmic relevance. These resources collectively establish benchmarks for multi-organ, multi-modality AI research in oncologic imaging, facilitating robust cross-study comparisons and fostering innovation in biomedical image analysis.