AutoPET III FDG PET/CT Dataset
- AutoPET III FDG Dataset is a large-scale, expertly annotated collection of whole-body FDG-PET/CT studies designed for benchmarking tumor segmentation algorithms.
- It includes 1,014 co-registered scans from diverse clinical groups and employs standardized preprocessing and augmentation protocols to support robust deep learning models.
- Baseline results and varied architectures highlight significant performance gains, making it essential for advancing oncologic PET/CT imaging research.
The AutoPET III FDG Dataset is a large-scale, expert-annotated cohort of whole-body [18F]fluorodeoxyglucose positron emission tomography and computed tomography (FDG-PET/CT) studies, released to facilitate the development, validation, and benchmarking of deep learning algorithms for tumor lesion segmentation in oncologic PET/CT imaging. It serves as a standardized resource within the AutoPET III Grand Challenge, focusing on multi-tracer, multi-center generalization across variable clinical and acquisition environments. The dataset is widely referenced for both algorithmic innovation and comparative evaluation within the PET/CT computational imaging community (Chutani et al., 2024, Kalisch et al., 2024, Rokuss et al., 2024, Alloula et al., 2023, Wang et al., 2024, Guha et al., 6 Jan 2026, Ahamed, 2024, Liu et al., 2024, Heiliger et al., 2022, Hadlich et al., 2023, Ahamed et al., 2023).
1. Dataset Composition and Clinical Spectrum
The AutoPET III FDG subset comprises 1,014 co-registered whole-body FDG-PET/CT studies from 900 unique patients, primarily collected at University Hospital Tübingen (UKT) along with additional cases from LMU Munich (Alloula et al., 2023, Wang et al., 2024, Heiliger et al., 2022). The cohort is stratified into four principal clinical groups: malignant melanoma, lymphoma, lung cancer, and negative controls (no suspicious uptake), with approximately half the population harboring histologically confirmed malignancy and the remainder serving as negative or healthy controls (Kalisch et al., 2024, Wang et al., 2024, Ahamed, 2024, Ahamed et al., 2023, Heiliger et al., 2022).
Breakdown by cohort (representative figures, where available):
| Subgroup | Number of Studies |
|---|---|
| Lung cancer | 168 |
| Melanoma | 188 |
| Lymphoma | 145 |
| Negative controls | 513 |
Patient-level demographics (age, sex distribution), scanner model assignments, and distribution by clinical site are not systematically disclosed in most challenge reports. Gender counts are reported for the UKT cohort (570 males, 444 females) (Wang et al., 2024). For detailed protocol variables (injection activity, uptake time, slice thickness), researchers are referred to the original FDG-PET-CT-Lesions dataset reference (Gatidis et al., 2022) and the TCIA data archive (Kalisch et al., 2024, Wang et al., 2024, Heiliger et al., 2022).
2. Imaging Acquisition and Protocol Variability
All studies include paired, spatially aligned FDG-PET and low-dose CT volumes, acquired in a clinical oncologic staging context (Heiliger et al., 2022, Alloula et al., 2023). Most reported FDG studies originate from UKT using a Siemens Biograph mCT scanner, with LMU scans included in later challenge editions (Wang et al., 2024, Alloula et al., 2023).
Critical protocol details, such as PET/CT vendor settings, injected FDG dose, uptake durations, voxel spacings, slice thickness, and PET/CT reconstruction kernels, are largely unspecified in the challenge and methods papers and must instead be sourced from the dataset’s public metadata (e.g., DOI: 10.7937/gkr0-xv29, Zenodo, or challenge releases) (Chutani et al., 2024, Kalisch et al., 2024, Wang et al., 2024, Alloula et al., 2023, Heiliger et al., 2022). However, common clinical standards in FDG-PET acquisition are presumed, entailing approximately 3–4 MBq/kg injected activity, ~60 min uptake, and diagnostic CT acquisition at 120 kVp with variable tube current (Alloula et al., 2023).
Volumes are stored as DICOM at source and converted to NIfTI for deep learning workflows (Rokuss et al., 2024, Kalisch et al., 2024, Alloula et al., 2023). All cases are provided as paired PET and CT channels, typically configured for nnU-Net or equivalent volumetric neural network inputs (Kalisch et al., 2024, Chutani et al., 2024, Ahamed et al., 2023).
3. Annotation Protocol, Ground Truth, and Dataset Format
Manual 3D segmentation of all FDG-avid tumor lesions constitutes the ground truth for algorithm benchmarking. Annotations were generally performed by domain-expert radiologists or nuclear medicine physicians, most frequently a single reader per center (e.g., 10 years’ experience at UKT, 8 years at LMU), though multi-reader annotation and inter-observer reliability metrics are not systematically reported (Chutani et al., 2024, Wang et al., 2024, Alloula et al., 2023).
Annotation was performed directly on hybrid PET/CT, targeting regions where FDG uptake exceeded the local physiological background and corresponded to morphological abnormality on CT. Anatomical labeling and exclusion criteria (e.g., physiological uptake in brain, myocardium, urinary system, inflammatory tissues) are described in the “FDG-PET-CT-Lesions” benchmark publications but often only referenced, not restated, in challenge technical reports (Alloula et al., 2023, Heiliger et al., 2022, Ahamed et al., 2023).
Mask labels are provided in NIfTI format, binary-coded (lesion=1, background=0), aligned voxel-wise to the input imaging (Rokuss et al., 2024, Kalisch et al., 2024, Hadlich et al., 2023). No granular breakdown of lesion count or volume by anatomical site is provided in most challenge documentation; aggregate statistics report ≈8,781 FDG lesions across 1,014 scans, mean ≈8.7 lesions/scan, with a heavy-tailed per-patient distribution (Ahamed, 2024).
4. Preprocessing, Data Partitioning, and Augmentation Strategies
Preprocessing
Standardized pipelines are essential for harmonizing heterogeneous multi-site, multi-vendor PET/CT data. Preprocessing steps, implemented primarily with nnU-Net v1 or v2 conventions, encompass:
- Spatial resampling of PET/CT volumes to a uniform (commonly isotropic 2 mm or challenge-specific full-res) grid, using trilinear (images)/nearest-neighbor (labels) interpolation (Chutani et al., 2024, Ahamed et al., 2023, Kalisch et al., 2024, Rokuss et al., 2024).
- Intensity normalization: z-score per volume or percentile clipping plus linear scaling for both PET (SUV units) and CT (HU, typically clipped [–1024, +1024]) (Chutani et al., 2024, Alloula et al., 2023, Ahamed et al., 2023). Some pipelines add min–max normalization or custom HU windows for CT, or non-zero voxel normalization (Liu et al., 2024).
- Foreground region cropping, typically leveraging TotalSegmentator for robust body cropping and background removal (Chutani et al., 2024, Rokuss et al., 2024, Hadlich et al., 2023).
- Patch extraction: 3D cubic or rectangular patches, with sample sizes ranging 96–224 voxels per axis, sampled to ensure foreground class balance (Alloula et al., 2023, Ahamed et al., 2023, Chutani et al., 2024).
- PET and CT channels concatenated for two-channel network input (Kalisch et al., 2024, Ahamed et al., 2023, Rokuss et al., 2024).
Data Partitioning
Data splits for algorithm training and evaluation typically employ 5-fold cross-validation, stratified by patient, cancer type, and center to minimize information leakage and class imbalance (Kalisch et al., 2024, Ahamed et al., 2023, Heiliger et al., 2022). Validation and test splits follow fixed partitions released by the challenge organizers, with hidden test sets (e.g., 200 cases) reserved for final leaderboard evaluation (Chutani et al., 2024, Wang et al., 2024, Alloula et al., 2023). Per-fold statistics and access protocols are detailed in each challenge edition.
Data Augmentation
Augmentation is critical for generalization under site and protocol variability. Default nnU-Net 3D augmentations are employed, typically including:
- Random spatial flips (x, y, z)
- Small- to moderate-angle random 3D rotations
- Elastic deformations
- Intensity scaling and gamma correction
- Gaussian noise, blurring, and random contrast/brightness shifts
- Misalignment augmentation, simulating PET–CT registration errors (Rokuss et al., 2024, Alloula et al., 2023)
Test-time augmentation (TTA) is used for robust inference, introducing dynamic random flips/rotations or ensemble predictions merged via algorithms such as STAPLE (Chutani et al., 2024).
5. Evaluation Metrics and Baseline Results
The principal quantitative benchmark is voxelwise overlap using the Dice Similarity Coefficient (DSC):
where and are predicted and ground-truth lesion masks, respectively. Additional metrics include:
- False positive volume (FPV): volumetric sum of predicted regions not overlapping ground truth
- False negative volume (FNV): volume of missed ground-truth lesions
Network performance varies with architecture, preprocessing, and augmentation strategy. Representative baseline and advanced model results on the FDG set:
| Method Overview | Dice (%) | FPV (ml) | FNV (ml) |
|---|---|---|---|
| nnU-Net (default, 3D fullres) (Kalisch et al., 2024) | 65.8 | 25.3 | 9.7 |
| nnU-Net + anatomical multi-label (Kalisch et al., 2024) | 76.9 | 3.8 | 6.9 |
| ResEncL multitask w/ misalignment aug (Rokuss et al., 2024) | 77.3 | 7.8 | 10.4 |
| 3D ResUNet + GenDiceFocal (Boij et al., 2024) | 60–64 | 4–8 | 7–12 |
| Swin Transformer UNet3D (Guha et al., 6 Jan 2026) | 88.0 | - | - |
| 3D Residual U-Net (AutoPET III winning team, test) (Chutani et al., 2024) | 96.3 | - | - |
Reported Dice performance ranges widely, reflecting differences in cross-validation vs. held-out test, inclusion of negative controls, anatomical supervision, and augmentation/loss strategies (Chutani et al., 2024, Kalisch et al., 2024, Rokuss et al., 2024, Guha et al., 6 Jan 2026, Ahamed, 2024, Ahamed et al., 2023, Hadlich et al., 2023). The top leaderboard models consistently outperform baseline nnU-Net.
6. Model Architectures and Postprocessing
The dataset is a standard benchmark for U-Net variants (nnU-Net, ResEncL-M, Residual U-Net) and transformer-based methods (SwinUNet3D, Swin UNETR), with experiments in:
- Multi-task learning and anatomical organ supervision, augmenting tumor masks with (TotalSegmentator-derived) physiological organ labels to reduce false positives (Kalisch et al., 2024, Rokuss et al., 2024).
- Patch-based vs. sliding-window inference, with ensemble merging via average, weighted, or STAPLE fusion (Ahamed et al., 2023, Chutani et al., 2024).
- Data-centric and model-centric robustness, including explicit misregistration augmentations (Rokuss et al., 2024).
- Sample attention boosting to adaptively weight difficult cases during training (Wang et al., 2024).
- Focal loss (α=0.25, γ=2.0) to handle severe class imbalance (Guha et al., 6 Jan 2026, Ahamed et al., 2023).
Postprocessing protocols typically include resampling of predicted masks back to native grid, connected component cleaning, and probability thresholding with possible use of physiological thresholds (e.g., SUV for PET) (Kalisch et al., 2024, Chutani et al., 2024).
7. Data Access, Licensing, and Usage
The AutoPET III master dataset (FDG + PSMA) is accessible for non-commercial research use under an institutional data use agreement via the challenge organizers. Data are distributed through the Zenodo repository (DOI: 10.5281/zenodo.10990932), The Cancer Imaging Archive (https://doi.org/10.7937/gkr0-xv29), and the AutoPET III challenge website (https://autopet.grand-challenge.org) (Chutani et al., 2024, Kalisch et al., 2024, Wang et al., 2024). Codebases, preprocessing scripts, and baseline models are hosted on open-source platforms, including:
- https://github.com/tanya-chutani-aira/autopetiii
- https://github.com/hakal104/autoPETIII
- https://github.com/MIC-DKFZ/autopet-3-submission
- https://github.com/anissa218/autopet_nnunet
Dataset organization conventionally follows nnU-Net folder structure (imagesTr/, labelsTr/, imagesTs/, dataset.json) (Kalisch et al., 2024, Rokuss et al., 2024).
8. Limitations and Reporting Gaps
Several critical aspects remain insufficiently documented in challenge and methods reporting:
- Detailed demographics (age, sex), per-lesion statistics, and clinical site breakdowns are generally absent.
- Acquisition protocols (scanner/vendor parameters, injected activity, reconstruction kernel) are omitted in most technical summaries.
- Annotation methodology (exact software, consensus mechanism, inter-observer agreement) is not typically discussed, necessitating consultation of primary dataset publications for reproducibility (Alloula et al., 2023, Heiliger et al., 2022).
- Organ-by-organ lesion distributions and voxel-level histograms are infrequently reported, although lesion frequency distributions indicate a strong class imbalance (large numbers of negative controls, long-tailed lesion count per patient) (Ahamed, 2024).
These gaps underscore the importance of referencing the primary data release (TCIA, Zenodo) and Gatidis et al.’s original documentation for granular technical parameters.
The AutoPET III FDG Dataset has established itself as a critical benchmark for oncologic PET/CT lesion segmentation, supporting reproducible research and enabling robust, fair comparison of algorithmic advances in multi-tracer and multi-center imaging contexts (Chutani et al., 2024, Kalisch et al., 2024, Rokuss et al., 2024, Alloula et al., 2023, Wang et al., 2024, Guha et al., 6 Jan 2026, Ahamed, 2024, Liu et al., 2024, Heiliger et al., 2022, Hadlich et al., 2023, Ahamed et al., 2023).