Papers
Topics
Authors
Recent
2000 character limit reached

Medical Segmentation Decathlon (MSD)

Updated 21 November 2025
  • Medical Segmentation Decathlon (MSD) is a comprehensive benchmark comprising 10 independent 3D segmentation tasks with expert-annotated labels across various modalities.
  • It standardizes dataset organization and evaluation using metrics like Dice and NSD, ensuring reproducible cross-task performance analysis.
  • The benchmark has spurred advances in segmentation models, including U-Net derivatives, NAS-based, and transformer-driven architectures for robust clinical application.

The Medical Segmentation Decathlon (MSD) is a large-scale biomedical image segmentation benchmark designed to drive and evaluate the development of general-purpose algorithms across a diverse set of clinical tasks, imaging modalities, and anatomical structures. MSD comprises ten multidomain 3D segmentation challenges, each with expert-annotated labels, and has become a central resource for reproducible benchmarking, cross-task evaluation, and the development of universal medical AI segmentation models (Simpson et al., 2019, Antonelli et al., 2021).

1. Composition and Structure of the MSD

The Medical Segmentation Decathlon dataset consists of ten independent 3D segmentation tasks, each targeting distinct anatomical regions, imaging modalities, and clinical applications. All data are distributed in the NIfTI format with standardized directory structures (imagesTr/, labelsTr/, imagesTs/) and accompanying JSON descriptors (Simpson et al., 2019):

Task Modality Target Structures # Volumes Voxel Spacing / Notes
Brain Tumour MRI (T1, T1-Gd, T2, FLAIR) Glioma subregions (edema, enhancing tumor, necrosis) 750 Co-registered, 1mm³ isotropic
Heart MRI (3D) Left atrium 30 1.25×1.25×2.7 mm
Liver CT (portal venous) Liver, tumors 201 0.5–1.0mm in-plane, 0.45–6mm slice
Hippocampus MRI (T1 MPRAGE) Hippocampal formation (head, body, tail) 195 1mm³ isotropic
Prostate MRI (T2, ADC) Prostate (transition, peripheral zones) 48 T2: 0.6×0.6×4mm, ADC: 2×2×4mm
Lung CT (non-contrast) Lung tumors 96 ~0.7mm in-plane, <1.5mm slice
Pancreas CT (portal venous) Parenchyma, pancreatic mass 420 ~0.7mm in-plane, 2.5mm slice
Hepatic Vessel CT (portal venous) Hepatic vessels (with/without tumors) 443 ~0.7–1.0mm in-plane, 2.5–5mm slice
Spleen CT (portal venous) Spleen 61 See hepatic vessel
Colon CT (portal venous) Colon tumors 190 ~0.7–1.0mm in-plane, 1–7.5mm slice

All scans are de-identified, often resampled to consistent spatial resolution per task, and annotated by domain-expert radiologists or neuroscientists. Each task presents unique challenges in label imbalance, anatomical variability, or imaging protocol heterogeneity (Simpson et al., 2019, Antonelli et al., 2021, Chernenkiy et al., 31 Jul 2024).

2. Benchmark Philosophy and Challenge Setup

The MSD was explicitly conceived to move beyond single-task, single-modality segmentation benchmarks. Organizers hypothesized that a model that achieves strong, consistent performance across varied tasks would generalize to new, unseen challenges and thus defined the "decathlon" protocol: participants must submit a single, adaptable pipeline—fixed architecture, hyperparameters, preprocessing—for all ten tasks (Antonelli et al., 2021).

MSD tasks cover a deliberate axis of biomedical imaging difficulties:

  • Small sample sizes (heart: 30 volumes), strongly unbalanced labels (large organs vs. small lesions), multi-site/institutional variability, multi-modal input (multi-parametric MRI, CT), and tasks requiring precise small-structure segmentation (e.g., hippocampus, hepatic vessels).
  • The tasks are split into "development" and "mystery" phases, with mystery tasks revealed only after method freeze to prevent overfitting.

Licensing follows CC-BY-SA: all data are open for commercial and non-commercial use with attribution and share-alike requirements (Simpson et al., 2019).

3. Evaluation Metrics and Ranking Methodology

Primary performance metrics in MSD are Dice Similarity Coefficient (DSC) and Normalized Surface Dice (NSD) (Simpson et al., 2019, Antonelli et al., 2021):

  • DSC:

DSC(P,G)=2 ∣P∩G∣∣P∣+∣G∣\mathrm{DSC}(P, G) = \frac{2\,|P \cap G|}{|P| + |G|}

where PP and GG are predicted and ground-truth voxel sets.

  • NSD:

NSD(P,G;τ)=12(∣{p∈∂P:d(p,∂G)≤τ}∣∣∂P∣+∣{g∈∂G:d(g,∂P)≤τ}∣∣∂G∣)\mathrm{NSD}(P, G; \tau) = \frac{1}{2} \left( \frac{|\{p \in \partial P : d(p, \partial G) \le \tau\}|}{|\partial P|} + \frac{|\{g \in \partial G : d(g, \partial P) \le \tau\}|}{|\partial G|} \right)

with Ï„\tau set per task (e.g., 1mm), quantifying surface agreement.

Aggregate rankings use a significance ranking: for each metric and ROI, Wilcoxon paired tests determine, for every algorithm, the number of competitors it significantly outperforms (no adjustment for multiple comparisons). Decathlon score averages these significance ranks across tasks and metrics per submission (Antonelli et al., 2021). Ancillary metrics such as Hausdorff Distance (HD95) and Average Symmetric Surface Distance (ASD) are reported in most follow-up works.

MSD catalyzed a diverse array of segmentation architectures and universal-learning paradigms.

  • Fully convolutional and U-Net derivatives dominate the initial field, with adaptations to 3D, residual connections, instance normalization, and deep supervision (Antonelli et al., 2021, Rippel et al., 2020).
  • AutoML and Neural Architecture Search (NAS): Automated search for task-specific or universal architectures is exemplified by DiNTS (differentiable 3D topology search), achieving the highest overall mean Dice and NSD on MSD at time of publication, surpassing highly engineered U-Net ensembles (He et al., 2021). Other NAS approaches include V-NAS (2D/3D/P3D cell search) (Zhu et al., 2019) and HyperSegNAS (one-shot supernet with meta-conditional reweighting) (Peng et al., 2021).
  • Transformer-based and hybrid models: Vision transformer (ViT), Swin Transformer, and hybrid CNN-Transformer encoders demonstrate clear gains over purely convolutional counterparts in global context capture and multi-organ performance. Swin UNETR and CATS-v2 outperform earlier state-of-the-art on multiple MSD tasks, particularly on challenging multi-organ CT cases (Tang et al., 2021, Li et al., 2023, Hatamizadeh et al., 2021).
  • Text-driven and foundation models: Recent works leverage CLIP-derived label embeddings and cross-modal controllers to enable universal models that segment 25+ organs and multiple tumors across the full MSD, with computational efficiency and generalizability to external datasets (Liu et al., 2023). The adaptation of vision-language and prompt-tuned image foundation models (SAM2-3dMed, MedSAM) further signal a shift toward scalable weak and semi-supervised segmentation pipelines (Yang et al., 10 Oct 2025, Häkkinen et al., 30 Sep 2024).
  • Weak supervision: For settings where pixel-wise annotation is scarce, methods combining low-threshold CAMs from multiple backbone classifiers via AND/OR rules match or exceed prior weakly supervised state-of-the-art, outperforming more complex pipelines with explicit pixel-level losses (Ostrowski et al., 2023).
  • Robustness: The MSD also provides a testbed for adversarial sensitivity assessments and robust model development, e.g., the ROG lattice architecture, which, when adversarially trained, maintains segmentation performance under severe input perturbations (Daza et al., 2021).

5. Generalization, Automation, and Impact

Longitudinal studies confirm the core MSD hypothesis: models displaying robust, cross-task generalization in the decathlon setting consistently transfer to new clinical segmentation problems. The pipeline-automation approach pioneered by nnU-Net—automated data analysis, pipeline configuration, and cross-validated ensembling—set a field-wide baseline and is now standard in clinical benchmark participation (Antonelli et al., 2021). MSD-driven methods, especially nnU-Net and its descendants, have won the majority of large-scale 3D segmentation challenges in the years following MSD's release (Antonelli et al., 2021).

Universal multi-dataset architectures, such as FIRENet (fabric bottleneck with ASPP3D nodes), efficiently learn optimal multi-scale arrangements with no per-task tuning and support simultaneous inference across all ten MSD tasks (Liu et al., 2020). NAS-based models further automate architecture selection and resource allocation, adjusting to task-specific memory and compute constraints (He et al., 2021, Peng et al., 2021).

MSD's open-access dataset and transparent evaluation protocol have standardized large-scale comparative analysis in 3D medical image segmentation and facilitated fair, reproducible cross-paper comparison.

6. Extensions, Limitations, and Future Directions

Subsequent expansions of MSD, such as high-quality, radiologist-validated annotations for colon/colorectal structures (Chernenkiy et al., 31 Jul 2024), as well as multi-cohort generalization studies, have revealed persistent challenges: inter- and intra-patient anatomical variability, thin and complex boundaries (e.g., colon wall, vessels), annotation protocol differences, and class imbalance remain limiting factors for universal algorithmic approaches.

Current research trajectories include:

The dataset's comprehensive annotation, cross-site diversity, and public licensing have made it foundational for the evaluation of universal and robust AI segmentation systems, with ripple effects into deployment, regulatory pathways, and automated clinical workflow integration (Simpson et al., 2019, Antonelli et al., 2021).

7. Representative Results and State-of-the-Art

The following table summarizes top results (mean Dice) per method on the MSD CT tasks (Test set, various years), illustrating ongoing gains in universal model capability:

Method Liver Pancreas HepaticVessel Spleen Colon Lung Average (CT)
nnU-Net 95.75 67.21 69.12 97.43 58.33 73.97 77.89
DiNTS 95.35 68.19 68.13 96.98 59.21 74.75 77.93
Swin UNETR 95.35 70.71 68.95 96.99 59.45 76.60 78.68
CLIP-Driven Universal Model 95.42 72.59 71.51 97.27 63.14 80.01 80.99
SAM2-3dMed (best single task) — 70.39 — 97.27 — 76.27 —

This table aggregates mean Dice from methods reporting per-class results on MSD test data. For detailed per-class, NSD, and HD95 results, see (Antonelli et al., 2021, Tang et al., 2021, He et al., 2021, Liu et al., 2023, Yang et al., 10 Oct 2025).

A plausible implication is that the combination of self-supervision, foundation model adaptation, and label prompt engineering is pushing universal medical segmentation models toward parity with, and often beyond, task-optimized pipelines.


References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Medical Segmentation Decathlon (MSD).