Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pan-Cancer Medical Imaging Datasets

Updated 26 April 2026
  • Pan-cancer imaging datasets are curated collections of multi-modal medical images with expert annotations across various cancers, enabling robust AI model development.
  • They employ advanced preprocessing and registration methods, such as z-score normalization and B-spline registration, to ensure comprehensive and reproducible data preparation.
  • These datasets facilitate a range of applications—from segmentation and synthesis to survival prediction—while addressing challenges like class imbalance and diverse organ representation.

Pan-cancer medical imaging datasets are curated collections of radiological or digital pathology images encompassing multiple cancer types, organs, and modalities, typically with expert-verified labels or annotations. Their explicit design is to support algorithmic development, benchmarking, and clinical translation for AI models capable of generalizing across malignancy spectrum, rather than being restricted to single-disease or organ-centric cohorts. These datasets address bottlenecks in AI-driven oncology, including class imbalance, organ diversity, annotation scarcity, and the need for robust domain adaptation. Recent pan-cancer datasets span modalities such as computed tomography (CT), positron emission tomography (PET)/CT fusion, whole-slide histopathology imaging (WSI), and multi-modal magnetic resonance imaging (MRI) with paired acquisition and cross-modality synthesis. Dataset design emphasizes comprehensive annotation protocols, harmonized imaging preprocessing, rigorous de-identification, and standards for reproducibility and accessibility.

1. Major Public Pan-Cancer Imaging Datasets

PETWB-REP: Multi-Cancer Whole-Body PET/CT and Reports

PETWB-REP consists of whole-body 18F-FDG PET/CT scans from 490 oncology patients (219 female, 271 male; mean age 60.98 ± 12.77) (Xue et al., 5 Nov 2025). The cohort covers 22 malignancy types, led by lung (34.29%), liver (10.00%), and cervical (7.76%) cancers. Each case includes DICOM and NIfTI-formatted PET and CT images, dual-language, de-identified radiology reports (Chinese original and English translation), and structured acquisition metadata covering age, sex, cancer type, tracer dose, and uptake time. Imaging was acquired on Siemens Biograph 64 PET/CT with standardized protocols:

  • Radiotracer dose: 3.70–5.55 MBq/kg 18F-FDG
  • Uptake: 60 min
  • CT: 120 kV, 170 mA, 3.0 mm slices
  • PET: OSEM reconstruction, SUV computation (SUVbw=RC[kBq/mL]×BW[kg]ID[MBq]\text{SUV}_{bw} = \frac{RC\, [\text{kBq/mL}] \times BW\, [\text{kg}]}{ID\, [\text{MBq}]})

The preprocessing pipeline comprises DICOM de-identification, conversion to NIfTI, CT z-score normalization, PET-to-SUV mapping, B-spline registration, and axial resampling. Reports are sectioned anatomically and provide detailed lesion descriptors (size, SUVmax), sectioned by region and modality.

PMPBench: Paired Multi-Modal Pan-Cancer Benchmark

PMPBench enables medical image synthesis tasks across 11 organs and provides rigorously aligned paired datasets comprising both CT (non-contrast and contrast-enhanced pairs) and dynamic contrast-enhanced (DCE) MRI with three time-phases (Chen et al., 22 Jan 2026). The collection includes 2,642 subjects, with varying modal contributions per organ, e.g., 1,116 breast (MRI only), 598 kidney, 432 liver, 86 stomach, and so on. All paired volumes undergo rigid, affine, and B-spline deformable registration (Elastix; 1-mm isotropic resampling); imaging intensities are normalized per modality (Hounsfield windowing for CT, z-score for MRI).

Strict anatomical correspondence supports 1-to-1, N-to-1, and N-to-N image translation benchmarking. The dataset includes detailed partitioning: 70% train, 10% val, 20% test, plus a 5% test-mini subset for fast development cycles.

FLARE 2023: Abdominal CT Organ and Pan-cancer Segmentation

FLARE 2023 is the largest open pan-cancer CT segmentation dataset, totaling 4,650 abdominal studies from over 40 centers across three continents (Ma et al., 2024). Organ (13 classes) and lesion segmentations encompass liver, pancreas, kidneys, spleen, stomach, and additional intra-abdominal sites, with annotated tumors including primary and metastatic disease. All images are provided in canonical NIfTI format, with associated metadata on patient demographics and scan protocol. Training splits include partially annotated (2,200) and fully unlabeled (1,800) cases; tuning and test splits comprise 100 and 400 fully annotated cases, respectively.

PASTA-Gen-30K: Synthetic Pan-Tumor Dataset

The PASTA synthetic dataset features 30,000 3D CT volumes with pixel-level masks for both ten major organs and 15 lesion types (10 malignant, 5 benign) (Lei et al., 10 Feb 2025). Each synthetic study is generated by inserting parametrically modeled lesions—diversified in size, shape, density, and relational attributes—into healthy organ CTs with in silico reports describing eight semantic features. The dataset is entirely synthetic, sidestepping privacy concerns; downloads and model code are open-access.

Pan-Cancer Histopathology WSI Datasets

WSI pretraining and evaluation for pan-cancer learning utilize collections such as TCGA-RCC (kidney, 659 slides), TCGA-NSCLC (lung, 3,064), USTC-EGFR (lung, 531), Endometrium (3,654), TCGA-EGFR (lung, 705), and BRCA-HER2 (breast, 279) (Wu et al., 2024). Slides are split into 256×256 patches at 20× magnification following tissue segmentation.

2. Imaging Modalities, Preprocessing, and Annotation Standards

Pan-cancer datasets encompass modalities including CT, PET/CT, DCE-MRI, and digital pathology WSI.

  • PETWB-REP: Provides both CT and PET in DICOM/NIfTI, with standardized SUV, rigid PET-to-CT registration, and removal of all personal health identifiers via dual-stage review (Xue et al., 5 Nov 2025).
  • PMPBench: Employs meticulous multimodal registration (Elastix), phase labeling, and per-organ cropping; radiologist ensures the alignment of non-contrast/contrast-enhanced and DCE image pairs (Chen et al., 22 Jan 2026).
  • FLARE 2023: Adopts RTOG/Netter’s guidelines for manual organ contouring, deploys MedSAM for lesion proposal, and requires expert validation of test/tuning masks (Ma et al., 2024).
  • PASTA-Gen-30K: All segmentation masks, lesion and organ, are auto-generated during synthetic case creation ensuring pixel-perfect ground truth, with radiology reports matched by protocol (Lei et al., 10 Feb 2025).
  • WSI Collections: Slides are preprocessed via tissue segmentation and patch tiling; embeddings extracted via pre-trained feature encoders (DINO/PLIP).

Annotations in these datasets range from free-text clinical reports (PETWB-REP), structured radiology attributes (PASTA, FLARE), to pixel-wise segmentation masks (FLARE, PASTA, WSI).

3. Benchmarking Tasks, Metrics, and Representative Baseline Methods

Segmentation and Synthesis

Benchmark tasks span organ segmentation, pan-cancer lesion segmentation, lesion detection, and image-to-image translation (e.g., non-contrast to contrast image synthesis):

  • FLARE 2023: Evaluated on Dice Similarity Coefficient (DSC), 95th Hausdorff Distance (HD95), and Normalized Surface Distance (NSD). The “aladdin5” cascaded nnU-Net framework achieved mean DSC of 92.3% (organs) and 64.9% (lesions), with 8.6 s/case inference (Ma et al., 2024).
  • PMPBench: Tasks defined as 1-to-1, N-to-1, and N-to-N translation, with image quality assessed via MSE, PSNR, SSIM, FID, LPIPS, and KID. Flow-matching architectures (FlowMI) outperformed GAN, diffusion, and transformer baselines across CT→CTC and DCE MRI tasks. Key results: FlowMI PSNR 24.47 dB / SSIM 78.5% for CT→CTC, with lowest FID and LPIPS across tasks (Chen et al., 22 Jan 2026).

Classification, Staging, Survival Prediction

  • PASTA: Supports 46 downstream tasks (segmentation, detection, staging, survival, report generation), reporting metrics such as DSC, accuracy, and AUC. Notably, PASTA improves mean DSC by up to 4.6% in full-data segmentation and 31.2% in few-shot regimes versus next-best models, with accuracy 0.954–0.970 and AUC 0.963–0.984 for plain-CT tumor classification (Lei et al., 10 Feb 2025).
  • PAMA: Slide-level WSI encoders are assessed on ACC, AUC, and F1-score for multi-class and multi-label classification, using standard 60%/10%/30% train/val/test splits (Wu et al., 2024).

4. Data Access, Distribution, and Licensing

Dataset Access Link / License Data Types
PETWB-REP Zenodo (public) (Xue et al., 5 Nov 2025) NIfTI/DICOM, CSV reports
PMPBench https://github.com/YifanChen02/PMPBench NIfTI, paired images
FLARE 2023 https://codalab.lisn.upsaclay.fr/… NIfTI, segmentation masks
PASTA-Gen-30K https://huggingface.co/datasets/LWHYC… NIfTI masks, CSV reports
PAMA (WSI) https://github.com/WkEEn/PAMA WSIs, metadata

All datasets employ strict de-identification (dual-pass PHI removal in PETWB-REP; synthetic-only in PASTA), IRB or equivalent approval, and non-commercial/research-only licenses (typically CC BY-NC or CC BY-NC-ND). FLARE and PMPBench require web-based registration and acceptance of data use agreements for access.

5. Implications, Limitations, and Research Impact

Pan-cancer datasets drive the development of generalizable AI models by exposing algorithms to multi-organ, multi-cancer, and multi-institutional variation.

Advantages:

  • Enable evaluation of model robustness across cancer types and imaging domains.
  • Facilitate radiomics, deep learning, multi-modal fusion, and semi-supervised learning at scale.
  • Address annotation scarcity via synthetic data (PASTA), harmonizing class coverage and reproducibility.
  • Support emergent tasks (contrast synthesis, report structuring, pan-cancer survival analysis).

Limitations:

  • Some datasets (e.g., PETWB-REP) have single-institution or scanner biases, missing explicit staging/laboratory fields, or absence of manual pixel-level lesion annotation (Xue et al., 5 Nov 2025).
  • Real-world datasets (FLARE 2023) often exhibit annotation sparsity for less common cancers or small lesions, and focus on specific anatomical regions (abdomen), necessitating expansion to thoracic or pelvic oncology (Ma et al., 2024).
  • PMPBench’s licensing restricts commercial applications (CC BY-NC-ND).

A plausible implication is that systematic inclusion of synthetic cases (e.g., PASTA-Gen-30K) may mitigate real-world annotation bottlenecks and enhance few-shot performance in rare tumor types (Lei et al., 10 Feb 2025).

Emergent areas within pan-cancer imaging datasets include the expansion to multi-omics and multi-modal structured reporting (combining imaging, text, molecular data), extension of annotation protocols to instance-level or molecular subtyping, and benchmarking of large vision-language or foundation models under pan-cancer regimes. Future releases are anticipated to increase granularity (instance segmentation, longitudinal follow-up), harmonize cross-institutional data directly, and support the integration of external validation cohorts for model generalizability assessment.

Continued development of rigorous annotation, accessibility protocols, and consensus benchmarks is central for reproducible, clinically translatable AI in oncologic imaging.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pan-cancer Medical Imaging Datasets.