Medical Segmentation Decathlon Benchmark

Updated 11 December 2025

Medical Segmentation Decathlon is a comprehensive benchmark uniting ten diverse, expert-annotated 3D segmentation tasks to evaluate and drive advancements in automated medical imaging.
It standardizes preprocessing and evaluation metrics such as the Dice Similarity Coefficient and NSD, fostering reproducibility and unbiased comparisons across institutions.
The initiative motivates paradigm shifts towards self-configuring deep learning models like nnU-Net, enabling parameter-free, robust segmentation that generalizes to unseen clinical tasks.

The Medical Segmentation Decathlon (MSD) is a foundational, large-scale benchmark designed to evaluate and drive progress in generalizable medical image segmentation across a wide spectrum of anatomies, imaging modalities, and clinical tasks. The initiative unified ten expert-annotated, open-access datasets and catalyzed the transition from single-task, hand-tuned pipelines towards robust, automated, and reproducible deep learning frameworks applicable "out-of-the-box" to new problems. The MSD challenge, its datasets, and its influence have now shaped both methodological and practical research trajectories in medical image analysis, self-configuring segmentation frameworks, neural architecture search, foundation model adaptation, and annotation efficiency.

1. Scope and Structure of the Medical Segmentation Decathlon

The MSD comprises ten semantically and anatomically diverse 3D segmentation tasks, each with expert-validated annotations, standardized preprocessing, and clearly specified evaluation criteria (Antonelli et al., 2021, Simpson et al., 2019). These tasks span brain tumors, cardiac chambers, abdominal organs, pelvic structures, vasculature, and major cancer sites, presenting a controlled axis of difficulties: small sample sizes (30–750 cases), pronounced class imbalance, multi-site and multi-vendor heterogeneity, and highly variable target scales. The data are provided in NIfTI format with rigorous anonymization and multi-stage preprocessing (alignment, normalization, bounding-box cropping).

Task ID	Structure	Modality	Typical Labels	Train/Test
01	Brain tumors	mp-MRI (4 seq)	Edema, necrosis, enhancing tumor	750
02	Left atrium	MRI	Blood pool	30
03	Liver / tumor	CT	Liver, tumor	201
04	Hippocampus	T1 MRI	Anterior, posterior	394
05	Prostate	T2 MRI + ADC	Peripheral & transition zones	48
06	Lung tumor	CT	Tumor	96
07	Pancreas / tumor	CT	Gland, tumor	420
08	Hepatic vessel/tumor	CT	Vessels, tumor	443
09	Spleen	CT	Spleen	61
10	Colon tumor	CT	Tumor	190

All datasets are publicly accessible, fostering reproducibility and cross-institutional algorithm comparisons (Simpson et al., 2019, Chernenkiy et al., 2024).

2. Evaluation Metrics and Protocol

Primary evaluation on MSD tasks centers on 3D Dice Similarity Coefficient (DSC):

$\mathrm{DSC}(X,Y)=\frac{2\,|X\cap Y|}{|X|+|Y|}$

where $X$ is the prediction mask and $Y$ the ground-truth for a given class. Critical assessment is also performed using boundary-sensitive metrics, such as normalized surface distance (NSD)—the proportion of the surface within a user-defined tolerance—and 95th-percentile Hausdorff distance. A "significance ranking" procedure aggregates Wilcoxon signed-rank test outcomes across teams and test folds, producing overall and per-task algorithm rankings (Antonelli et al., 2021).

3. Impact on Segmentation Algorithm Development

The MSD exposed the limitations of hand-crafted or anatomy-specific pipelines and validated the effectiveness of robust, self-configuring deep learning systems. The highest ranked submissions (predominantly U-Net variants such as nnU-Net) demonstrated that (a) consistent cross-task performance is a robust predictor of generalizability to unseen domains, and (b) fully automated, "parameter-free" architectures eliminate human-in-the-loop bias and overfitting tendencies (Antonelli et al., 2021). This was empirically confirmed by the longitudinal dominance of nnU-Net across 53 further anatomical tasks post-MSD (Antonelli et al., 2021).

Methodologically, the MSD stimulated and benchmarked multiple paradigm shifts:

Automated hyperparameter inference and dynamic architecture adjustment (e.g., nnU-Net, plain U-Net, MPUnet) (Rippel et al., 2020, Perslev et al., 2019).
Universal multi-dataset learning through implicit multi-scale feature routing (e.g., FIRENet) (Liu et al., 2020).
Differentiable and one-shot neural architecture search (e.g., DiNTS, HyperSegNAS) to jointly optimize topology, memory footprint, and feature propagation across highly variable input resolutions (He et al., 2021, Peng et al., 2021).
Transformer-based high-resolution segmentation models with persistent multi-resolution branches (e.g., HRSTNet) (Wei et al., 2022).
Foundation models and interactive annotation strategies leveraging cross-domain transfer (e.g., SAM2-3dMed, SAM-generated pseudo labels) (Yang et al., 10 Oct 2025, Häkkinen et al., 2024, Shen et al., 2024).

4. Core Methodological Innovations and Benchmarks

Prominent frameworks and their MSD performance:

nnU-Net: Self-configuring, 3D U-Net–based, task-agnostic system with automated patch sizing, data normalization, augmentation, and ensembling. Achieved highest median Dice across development and mystery datasets, and displayed low rank variance, confirming its robustness (Antonelli et al., 2021).
FIRENet: 3D encoder-decoder with a fabric representation module that learns optimal sub-architecture weighting for each task, achieving simultaneous multi-task performance without per-task tuning (Liu et al., 2020).
DiNTS: Flexible, differentiable NAS that searches both macro- and micro-level connections with explicit topology-loss and GPU memory constraints; led to best average Dice and NSD across all ten MSD tasks, particularly excelling on anatomically heterogeneous regions (He et al., 2021).
HyperSegNAS: Introduced a topology-aware HyperNet for one-shot 3D NAS, efficiently discovering high-Dice architectures with extensive skip-connections and multi-scale fusions (Peng et al., 2021).

Additionally, high-resolution transformer models (HRSTNet), robust lightweight lattices (ROG), and annotation-efficient approaches (SAM2, pseudo-labeling) all use MSD for rigorous evaluation (Wei et al., 2022, Daza et al., 2021, Shen et al., 2024, Häkkinen et al., 2024).

5. Foundation Models, Annotation Efficiency, and Dataset Extensions

Recent directions emphasize scaling annotation impact, annotation efficiency, and leveraging computer vision foundation models for medical domains:

SAM2 adaptation (SAM2-3dMed): Addressed anatomical continuity and boundary sharpness gaps between natural video foundation models and 3D medical images by introducing slice-relative position prediction and explicit boundary detection auxiliary losses, realizing superior Dice, IoU, HD95, and NSD on lung, spleen, and pancreas tasks (Yang et al., 10 Oct 2025).
Interactive segmentation (SAM2, pseudo-labeling): Approaches such as zero-shot mask propagation via video-masklet tracking ("volume as video") yield near–state-of-the-art Dice on well-defined organs and drastically reduce annotation requirements compared to prior 3D IMIS systems (Shen et al., 2024). Weakly supervised U-Nets trained on SAM-generated pseudo labels with simple box prompts achieve Dice scores within 0.02–0.04 of fully supervised models (Häkkinen et al., 2024).
Dataset expansion: The MSD format supported efforts to expand label scope (e.g., adding validated colon masks to tumor annotations in the colon dataset), raising benchmark Dice to 0.865 for tumor and 0.699 for colon with strong cross-validation robustness (Chernenkiy et al., 2024).

6. Community Impact, Generalizability, and Open Access

The MSD challenge and datasets established a new standard for open, unbiased, and reusable benchmarking in volumetric medical image segmentation. Empirically, algorithms that ranked consistently well across the MSD were shown to retain high performance on dozens of previously unseen segmentation tasks, including those with adversarial perturbations or rare class targets (Antonelli et al., 2021, Daza et al., 2021). The MSD catalyzed adoption of automated pipelines, democratized access to multimodal 3D data, and grounded the empirical assessment of emerging transfer learning and foundation model approaches.

The datasets—offered under a permissive CC-BY-SA 4.0 license—remain at http://medicaldecathlon.com/, with extensive documentation, metadata, and scripts for community extension and reuse (Simpson et al., 2019, Chernenkiy et al., 2024).

7. Limitations and Future Directions

Current SOTA architectures still show degraded performance or poor boundary localization on small, irregularly shaped, or low-contrast targets (e.g., colon, vessels, hippocampus). The challenge format, by design, focuses on volumetric overlap, and reported metrics may not completely reflect clinically relevant surface errors, motivating secondary evaluation using Hausdorff/NSD (Antonelli et al., 2021, Yang et al., 10 Oct 2025). Emerging questions include the adaptability of video-centric or vision-language foundation models, handling domain shift, and the principled combination of annotation-efficient and fully supervised approaches. Fully automated pipelines with guaranteed boundary fidelity, domain-aware adaptation, and efficient label utilization will define the next wave of MSD-driven research.

The Medical Segmentation Decathlon—by defining rigorous, multi-task benchmarks and fostering methodologically diverse, reproducible research—has become the central axis for development and evaluation of generalizable medical image segmentation strategies and remains the preeminent incubator for innovation across architectures, annotation, and open science in biomedical imaging (Antonelli et al., 2021, Simpson et al., 2019).