AutoPET Challenge 2024 Overview

Updated 22 May 2026

AutoPET Challenge 2024 is an international initiative benchmarking fully automated lesion segmentation on whole-body PET/CT imaging, emphasizing cross-tracer and multi-center generalization.
The challenge evaluates segmentation performance using metrics like Dice Similarity Coefficient, False Negative Volume, and False Positive Volume to ensure robust clinical assessment.
Data-centric innovations, including specialized augmentation and misalignment techniques, have significantly improved algorithm accuracy across diverse tracer and institution combinations.

AutoPET Challenge 2024, formally the third autoPET challenge (autoPET3), is an international benchmarking initiative held at MICCAI 2024, designed to assess the generalization ability of fully automated lesion segmentation algorithms applied to whole-body PET/CT imaging across multiple tracers and clinical centers. The 2024 challenge expanded the complexity of the segmentation task by explicitly requiring algorithms to operate robustly across both 18F-FDG and PSMA tracers, with a test set incorporating unseen tracer-institution pairings, and introduced a dedicated category focused on data-centric optimization using a fixed baseline model. This challenge established new technical baselines for multi-tracer, multi-center lesion segmentation and provided the largest publicly available annotated PSMA PET/CT dataset to date (Dexl et al., 7 May 2026).

1. Challenge Structure and Dataset Composition

AutoPET 2024 comprised two award tracks: AC1 (“Best generalizing model”), permitting open choice of architecture and external data, and AC2 (“Data-centric excellence”), restricting participants to a fixed baseline model (nnU-Net/MONAI) but enabling arbitrary data handling and curation strategies (Dexl et al., 7 May 2026, Kovacs et al., 2024). The training data included:

1,014 [18F]-FDG PET/CT scans from University Hospital Tübingen (UKT)
597 [68Ga]/[18F]-PSMA PET/CT scans from LMU Munich

The test set (200 studies) was explicitly partitioned to assess compositional generalization, with in-domain splits (FDG/UKT, PSMA/LMU) and out-of-domain splits representing unseen tracer-center combinations (FDG/LMU, PSMA/UKT). Test set annotation was withheld to enable unbiased leaderboard evaluation.

All volumes comprised co-registered, full-body 3D PET (Standardized Uptake Value) and CT (Hounsfield Units) data. Lesion masks—serving as ground truth—were delineated by domain expert radiologists. Notably, the PSMA dataset constitutes the largest expert-labeled set currently public, supporting research into tracer-specific generalization (Dexl et al., 7 May 2026).

2. Evaluation Protocol and Metrics

Assessment focused on detection and volumetric overlap of all tracer-avid tumor lesions, operationalized as a binary voxel-wise segmentation task. Primary metrics included:

Dice Similarity Coefficient (DSC):

$\mathrm{DSC} = \frac{2\,|G \cap P|}{|G| + |P|}$

where $G$ and $P$ denote ground-truth and predicted lesion masks, respectively.

False Negative Volume (FNV):

$\mathrm{FNV} = v \sum_i |G_i|\mathbb{1}(|G_i \cap P| = 0)$

summing missed lesion volumes.

False Positive Volume (FPV):

$\mathrm{FPV} = v \sum_l |P_l|\mathbb{1}(|P_l \cap G| = 0)$

measuring the volume of spurious predictions (Dexl et al., 7 May 2026, Kovacs et al., 2024).

Evaluation was performed only on lesion-positive cases to avoid inflation of DSC by trivial empty masks. Final rankings were determined by aggregate performance across all four test splits using multiple bootstrapped and alternative ranking schemes to confirm stability at the top of the leaderboard (Dexl et al., 7 May 2026).

3. Dominant Methodologies and Model Architectures

Most entries were based on nnU-Net or its derivatives, leveraging deep residual encoder UNets and multi-branch ensemble strategies (Dexl et al., 7 May 2026, Chutani et al., 2024). Common methodological elements included:

Input Handling: Multichannel concatenation of PET (SUV-converted, intensity-normalized) and CT (HU-normalized/cropped) (Kovacs et al., 2024, Chutani et al., 2024).
Patch-based Training: Sampling of contextually rich 3D patches, often with empirical balancing of lesion/background occurrence for class imbalance mitigation (Chutani et al., 2024, Mesbah et al., 2024).
Augmentation Pipelines: Affine transformations, elastic deformations, and intensity perturbations were standard. Data-centric tracks explored domain-specific augmentation, such as CT-only rigid misalignment (“misalDA”), to simulate clinical registration errors (Kovacs et al., 2024). In addition, generative data augmentation using diffusion-based synthetic lesion insertion (DiffTumor adaptation) was shown to provide up to +8.9% DSC in limited data regimes (Chan et al., 2024).
Inference-Time Optimization: Dynamic test-time augmentation (TTA) and ensembling, governed by scan-size-dependent schedulers, maximized performance within hard 5-minute runtime constraints per test case (Kovacs et al., 2024).

The most effective entries employed model ensembles (cross-validated fold models, sometimes fused by STAPLE), extensive TTA, and anatomical cropping via external tools such as TotalSegmentator (Chutani et al., 2024, Mesbah et al., 2024).

4. Data-Centric Innovations

The 2024 challenge explicitly benchmarked data handling and curation. The winning data-centric pipeline (“subtleDA + misalDA”) involved:

Minimalist Affine Augmentation: Reduction of transformation amplitudes to 10–20% of the nnU-Net default for realistic deformation (Kovacs et al., 2024).
Removal of Smoothing/Extreme Contrast: Gaussian smoothing and invert-gamma were omitted to preserve the visibility of tiny lesions (1–5 voxels).
CT-Only Misalignment Augmentation: With 0.1 probability, CT was rotated ( $\theta \sim \mathrm{Uniform}(-5^\circ, +5^\circ)$ ) and shifted ( $t_x, t_y \sim [-2,2]$ voxels), with PET and labels held fixed—forcing robustness to inter-modality registration perturbations.
Dynamic Inference Scheduler: Test-time scheduling of ensemble/TTA permutations per scan, bounded by per-scan inference latency (≤5 min), allocating up to 170 seconds to multi-fold TTA and the remainder to I/O overhead.

This pipeline yielded +0.81pp Dice and reduced FNV by ~6 mL versus baseline, particularly benefitting the detection of punctate lesions. Notably, gains were similar for both tracers, suggesting improved cross-domain generalization (Kovacs et al., 2024).

5. Quantitative Results and Benchmark Performance

The top-performing system in the main challenge (“LesionTracer A”) achieved, averaged over four compositional test splits (Dexl et al., 7 May 2026):

Metric	Winner (LesionTracer A)	Baseline (nnU-Net)
Mean DSC	0.66	~0.58
Mean False Neg. Volume	3.18 mL	8.21 mL
Mean False Pos. Volume	2.78 mL	2.25 mL

Key insights:

In-domain splits (tracer+center seen in training) yielded DSC ~0.70, comparable to human reader agreement levels.
Out-of-domain compositional splits suffered systematic volume overestimation (median +70%), attributed to false positives in previously unseen physiological uptake patterns and failures to detect small/low-uptake lesions, especially in PSMA_LMU (Dexl et al., 7 May 2026).
Bootstrap resampling and alternative ranking kept the top-four algorithm ordering stable, indicating robust metric separation at the top of the leaderboard (Dexl et al., 7 May 2026).

6. Limitations, Patient Variability, and Future Focus

Linear mixed-effects modeling revealed that inter-patient heterogeneity explained ~61% of DSC variance, with only ~1.3% attributable to algorithm choice among top entries. Lesion characteristics—especially volume (<0.2 mL) and SUVmax (<4)—strongly affected detectability, with sub-60% sensitivity for the smallest/lowest-uptake lesions. These factors caused most residual performance variation (Dexl et al., 7 May 2026).

While in-domain, multi-tracer PET/CT segmentation appears to be nearing the clinical reader agreement ceiling, compositional generalization to new tracer–center pairings remains unresolved, with large-scale, harmonized dataset collection and domain-adaptive learning flagged as essential directions (Dexl et al., 7 May 2026).

7. Practical Recommendations and Implications

Based on systematic ablation and benchmarking across challenge entries, the following synthesis was recommended for future AutoPET-like segmentation pipelines (Kovacs et al., 2024, Dexl et al., 7 May 2026):

Start with a robust DA baseline; avoid any transformation that blurs or inverts intensities, as this degrades punctate lesion performance.
Incorporate photorealistic PET/CT misalignment augmentation, preferably by jittering CT only with magnitude selection grounded in clinical registration variance.
Adopt online (per-epoch) augmentation to maximize exposure; avoid pre-generation.
Schedule TTA/ensembles per-case, adapting to scan size and hardware limits, with ≤2 TTAs/model for efficiency.
Post-process with SUV-based masking (e.g., SUV<1.0), particularly to mitigate FP in regions of high physiological uptake.
Validate via 5-fold cross-validation stratified across tracers, reporting balanced and per-tracer Dice.

The field is now moving toward instance- and region-aware metrics, uncertainty quantification for triage, and interactive or semi-supervised workflows for compositional generalization, as highlighted by the forward-looking orientation of AutoPET IV (Huang et al., 2 Sep 2025).

References:

(Dexl et al., 7 May 2026) The autoPET3 Challenge -- Automated Lesion Segmentation in Whole-Body PET/CT - Multitracer Multicenter Generalization
(Kovacs et al., 2024) Data-Centric Strategies for Overcoming PET/CT Heterogeneity: Insights from the AutoPET III Lesion Segmentation Challenge
(Chutani et al., 2024) AutoPET III Challenge: Tumor Lesion Segmentation using ResEnc-Model Ensemble
(Mesbah et al., 2024) AutoPETIII: The Tracer Frontier. What Frontier?
(Ahamed, 2024) AutoPET Challenge III: Testing the Robustness of Generalized Dice Focal Loss trained 3D Residual UNet
(Chan et al., 2024) AutoPET Challenge: Tumour Synthesis for Data Augmentation
(Huang et al., 2 Sep 2025) autoPET IV challenge: Incorporating organ supervision and human guidance for lesion segmentation in PET/CT