Medical Segmentation Decathlon (MSD)

Updated 21 November 2025

Medical Segmentation Decathlon (MSD) is a comprehensive benchmark comprising 10 independent 3D segmentation tasks with expert-annotated labels across various modalities.
It standardizes dataset organization and evaluation using metrics like Dice and NSD, ensuring reproducible cross-task performance analysis.
The benchmark has spurred advances in segmentation models, including U-Net derivatives, NAS-based, and transformer-driven architectures for robust clinical application.

The Medical Segmentation Decathlon (MSD) is a large-scale biomedical image segmentation benchmark designed to drive and evaluate the development of general-purpose algorithms across a diverse set of clinical tasks, imaging modalities, and anatomical structures. MSD comprises ten multidomain 3D segmentation challenges, each with expert-annotated labels, and has become a central resource for reproducible benchmarking, cross-task evaluation, and the development of universal medical AI segmentation models (Simpson et al., 2019, Antonelli et al., 2021).

1. Composition and Structure of the MSD

The Medical Segmentation Decathlon dataset consists of ten independent 3D segmentation tasks, each targeting distinct anatomical regions, imaging modalities, and clinical applications. All data are distributed in the NIfTI format with standardized directory structures (imagesTr/, labelsTr/, imagesTs/) and accompanying JSON descriptors (Simpson et al., 2019):

Task	Modality	Target Structures	# Volumes	Voxel Spacing / Notes
Brain Tumour	MRI (T1, T1-Gd, T2, FLAIR)	Glioma subregions (edema, enhancing tumor, necrosis)	750	Co-registered, 1mm³ isotropic
Heart	MRI (3D)	Left atrium	30	1.25×1.25×2.7 mm
Liver	CT (portal venous)	Liver, tumors	201	0.5–1.0mm in-plane, 0.45–6mm slice
Hippocampus	MRI (T1 MPRAGE)	Hippocampal formation (head, body, tail)	195	1mm³ isotropic
Prostate	MRI (T2, ADC)	Prostate (transition, peripheral zones)	48	T2: 0.6×0.6×4mm, ADC: 2×2×4mm
Lung	CT (non-contrast)	Lung tumors	96	~0.7mm in-plane, <1.5mm slice
Pancreas	CT (portal venous)	Parenchyma, pancreatic mass	420	~0.7mm in-plane, 2.5mm slice
Hepatic Vessel	CT (portal venous)	Hepatic vessels (with/without tumors)	443	~0.7–1.0mm in-plane, 2.5–5mm slice
Spleen	CT (portal venous)	Spleen	61	See hepatic vessel
Colon	CT (portal venous)	Colon tumors	190	~0.7–1.0mm in-plane, 1–7.5mm slice

All scans are de-identified, often resampled to consistent spatial resolution per task, and annotated by domain-expert radiologists or neuroscientists. Each task presents unique challenges in label imbalance, anatomical variability, or imaging protocol heterogeneity (Simpson et al., 2019, Antonelli et al., 2021, Chernenkiy et al., 2024).

2. Benchmark Philosophy and Challenge Setup

The MSD was explicitly conceived to move beyond single-task, single-modality segmentation benchmarks. Organizers hypothesized that a model that achieves strong, consistent performance across varied tasks would generalize to new, unseen challenges and thus defined the "decathlon" protocol: participants must submit a single, adaptable pipeline—fixed architecture, hyperparameters, preprocessing—for all ten tasks (Antonelli et al., 2021).

MSD tasks cover a deliberate axis of biomedical imaging difficulties:

Small sample sizes (heart: 30 volumes), strongly unbalanced labels (large organs vs. small lesions), multi-site/institutional variability, multi-modal input (multi-parametric MRI, CT), and tasks requiring precise small-structure segmentation (e.g., hippocampus, hepatic vessels).
The tasks are split into "development" and "mystery" phases, with mystery tasks revealed only after method freeze to prevent overfitting.

Licensing follows CC-BY-SA: all data are open for commercial and non-commercial use with attribution and share-alike requirements (Simpson et al., 2019).

3. Evaluation Metrics and Ranking Methodology

Primary performance metrics in MSD are Dice Similarity Coefficient (DSC) and Normalized Surface Dice (NSD) (Simpson et al., 2019, Antonelli et al., 2021):

DSC:

$\mathrm{DSC}(P, G) = \frac{2\,|P \cap G|}{|P| + |G|}$

where $P$ and $G$ are predicted and ground-truth voxel sets.

NSD:

$\mathrm{NSD}(P, G; \tau) = \frac{1}{2} \left( \frac{|\{p \in \partial P : d(p, \partial G) \le \tau\}|}{|\partial P|} + \frac{|\{g \in \partial G : d(g, \partial P) \le \tau\}|}{|\partial G|} \right)$

with $\tau$ set per task (e.g., 1mm), quantifying surface agreement.

Aggregate rankings use a significance ranking: for each metric and ROI, Wilcoxon paired tests determine, for every algorithm, the number of competitors it significantly outperforms (no adjustment for multiple comparisons). Decathlon score averages these significance ranks across tasks and metrics per submission (Antonelli et al., 2021). Ancillary metrics such as Hausdorff Distance (HD95) and Average Symmetric Surface Distance (ASD) are reported in most follow-up works.

4. Algorithmic Developments and Performance Trends

MSD catalyzed a diverse array of segmentation architectures and universal-learning paradigms.

Fully convolutional and U-Net derivatives dominate the initial field, with adaptations to 3D, residual connections, instance normalization, and deep supervision (Antonelli et al., 2021, Rippel et al., 2020).
AutoML and Neural Architecture Search (NAS): Automated search for task-specific or universal architectures is exemplified by DiNTS (differentiable 3D topology search), achieving the highest overall mean Dice and NSD on MSD at time of publication, surpassing highly engineered U-Net ensembles (He et al., 2021). Other NAS approaches include V-NAS (2D/3D/P3D cell search) (Zhu et al., 2019) and HyperSegNAS (one-shot supernet with meta-conditional reweighting) (Peng et al., 2021).
Transformer-based and hybrid models: Vision transformer (ViT), Swin Transformer, and hybrid CNN-Transformer encoders demonstrate clear gains over purely convolutional counterparts in global context capture and multi-organ performance. Swin UNETR and CATS-v2 outperform earlier state-of-the-art on multiple MSD tasks, particularly on challenging multi-organ CT cases (Tang et al., 2021, Li et al., 2023, Hatamizadeh et al., 2021).
Text-driven and foundation models: Recent works leverage CLIP-derived label embeddings and cross-modal controllers to enable universal models that segment 25+ organs and multiple tumors across the full MSD, with computational efficiency and generalizability to external datasets (Liu et al., 2023). The adaptation of vision-language and prompt-tuned image foundation models (SAM2-3dMed, MedSAM) further signal a shift toward scalable weak and semi-supervised segmentation pipelines (Yang et al., 10 Oct 2025, Häkkinen et al., 2024).
Weak supervision: For settings where pixel-wise annotation is scarce, methods combining low-threshold CAMs from multiple backbone classifiers via AND/OR rules match or exceed prior weakly supervised state-of-the-art, outperforming more complex pipelines with explicit pixel-level losses (Ostrowski et al., 2023).
Robustness: The MSD also provides a testbed for adversarial sensitivity assessments and robust model development, e.g., the ROG lattice architecture, which, when adversarially trained, maintains segmentation performance under severe input perturbations (Daza et al., 2021).

5. Generalization, Automation, and Impact

Longitudinal studies confirm the core MSD hypothesis: models displaying robust, cross-task generalization in the decathlon setting consistently transfer to new clinical segmentation problems. The pipeline-automation approach pioneered by nnU-Net—automated data analysis, pipeline configuration, and cross-validated ensembling—set a field-wide baseline and is now standard in clinical benchmark participation (Antonelli et al., 2021). MSD-driven methods, especially nnU-Net and its descendants, have won the majority of large-scale 3D segmentation challenges in the years following MSD's release (Antonelli et al., 2021).

Universal multi-dataset architectures, such as FIRENet (fabric bottleneck with ASPP3D nodes), efficiently learn optimal multi-scale arrangements with no per-task tuning and support simultaneous inference across all ten MSD tasks (Liu et al., 2020). NAS-based models further automate architecture selection and resource allocation, adjusting to task-specific memory and compute constraints (He et al., 2021, Peng et al., 2021).

MSD's open-access dataset and transparent evaluation protocol have standardized large-scale comparative analysis in 3D medical image segmentation and facilitated fair, reproducible cross-paper comparison.

6. Extensions, Limitations, and Future Directions

Subsequent expansions of MSD, such as high-quality, radiologist-validated annotations for colon/colorectal structures (Chernenkiy et al., 2024), as well as multi-cohort generalization studies, have revealed persistent challenges: inter- and intra-patient anatomical variability, thin and complex boundaries (e.g., colon wall, vessels), annotation protocol differences, and class imbalance remain limiting factors for universal algorithmic approaches.

Current research trajectories include:

Integration of textual and anatomical priors via multimodal transformer architectures (Liu et al., 2023).
Self-supervised and semi-supervised pre-training for improved cross-modal transfer (Tang et al., 2021, Hatamizadeh et al., 2022).
Active learning and human-in-the-loop correction with foundation model pseudo-labeling (Häkkinen et al., 2024, Yang et al., 10 Oct 2025).
Increased attention to boundary-aware and multi-scale supervision (e.g., explicit boundary decoding in SAM2-3dMed) to tackle fine structure segmentation (Yang et al., 10 Oct 2025).
Continued extension of open, radiologist-validated datasets to additional anatomical sites and imaging modalities (Chernenkiy et al., 2024).

The dataset's comprehensive annotation, cross-site diversity, and public licensing have made it foundational for the evaluation of universal and robust AI segmentation systems, with ripple effects into deployment, regulatory pathways, and automated clinical workflow integration (Simpson et al., 2019, Antonelli et al., 2021).

7. Representative Results and State-of-the-Art

The following table summarizes top results (mean Dice) per method on the MSD CT tasks (Test set, various years), illustrating ongoing gains in universal model capability:

Method	Liver	Pancreas	HepaticVessel	Spleen	Colon	Lung	Average (CT)
nnU-Net	95.75	67.21	69.12	97.43	58.33	73.97	77.89
DiNTS	95.35	68.19	68.13	96.98	59.21	74.75	77.93
Swin UNETR	95.35	70.71	68.95	96.99	59.45	76.60	78.68
CLIP-Driven Universal Model	95.42	72.59	71.51	97.27	63.14	80.01	80.99
SAM2-3dMed (best single task)	—	70.39	—	97.27	—	76.27	—

This table aggregates mean Dice from methods reporting per-class results on MSD test data. For detailed per-class, NSD, and HD95 results, see (Antonelli et al., 2021, Tang et al., 2021, He et al., 2021, Liu et al., 2023, Yang et al., 10 Oct 2025).

A plausible implication is that the combination of self-supervision, foundation model adaptation, and label prompt engineering is pushing universal medical segmentation models toward parity with, and often beyond, task-optimized pipelines.

References

(Simpson et al., 2019) A large annotated medical image dataset for the development and evaluation of segmentation algorithms
(Antonelli et al., 2021) The Medical Segmentation Decathlon
(Tang et al., 2021) Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis
(Liu et al., 2023) CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection
(He et al., 2021) DiNTS: Differentiable Neural Network Topology Search for 3D Medical Image Segmentation
(Hatamizadeh et al., 2021) UNETR: Transformers for 3D Medical Image Segmentation
(Li et al., 2023) CATS v2: Hybrid encoders for robust medical segmentation
(Yang et al., 10 Oct 2025) SAM2-3dMed: Empowering SAM2 for 3D Medical Image Segmentation
(Häkkinen et al., 2024) Medical Image Segmentation with SAM-generated Annotations
(Rippel et al., 2020) AutoML Segmentation for 3D Medical Image Data: Contribution to the MSD Challenge 2018
(Liu et al., 2020) Generalisable 3D Fabric Architecture for Streamlined Universal Multi-Dataset Medical Image Segmentation
(Peng et al., 2021) HyperSegNAS: Bridging One-Shot Neural Architecture Search with 3D Medical Image Segmentation using HyperNet
(Hatamizadeh et al., 2022) UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation
(Daza et al., 2021) Towards Robust General Medical Image Segmentation
(Chernenkiy et al., 2024) Expanding the Medical Decathlon dataset: segmentation of colon and colorectal cancer from computed tomography images