MedMNIST Benchmark Suite

Updated 25 September 2025

MedMNIST is a multi-modal benchmark suite comprised of standardized 2D and 3D biomedical image datasets that enable rapid prototyping and reproducible research.
The suite includes datasets spanning multiple imaging modalities and task types, such as binary, multi-class, ordinal regression, and multi-label classifications.
Its uniform data curation, extensive benchmarking protocols, and open-source accessibility foster innovations in AutoML, robustness assessment, and clinical applications.

MedMNIST is a standardized, multi-modal benchmark suite of lightweight biomedical image datasets designed for rapid prototyping, educational use, systematic benchmarking of automated machine learning (AutoML), and the study of methodological advances in medical image analysis. Initiated with the aim to provide accessible, highly reproducible, and computationally inexpensive resources, MedMNIST has evolved through multiple versions to encompass an expanding array of medical imaging tasks, modalities, data scales, and technical challenges relevant to both the research and clinical communities.

1. Dataset Structure and Curation

MedMNIST comprises collections of 2D and 3D biomedical image datasets, each split into official train, validation, and test partitions to facilitate reproducibility and fair comparison. The initial release focused on ten pre-processed datasets resized to 28×28 pixels (Yang et al., 2020). MedMNIST v2 expanded to twelve 2D sets (totaling >700,000 images) and six 3D sets (∼10,000 volumes), all standardized to either 28×28 (2D) or 28×28×28 (3D) and packaged as NumPy arrays (Yang et al., 2021). Cubic spline interpolation and RGB conversion (for legacy grayscale sources) ensured uniformity in spatial and channel dimensions.

The datasets span major radiological and histopathological imaging modalities:

Dataset Name	Modality/Anatomy	Task Type(s)
PathMNIST	Colon pathology (microscopy)	Multi-class
ChestMNIST	Chest X-ray	Binary & multi-label
PneumoniaMNIST	Chest X-ray	Binary
DermaMNIST	Dermatoscopy	Multi-class
OCTMNIST	OCT (retinal)	Multi-class
RetinaMNIST	Retina fundus (photography)	Ordinal regression
BloodMNIST	Blood smear (microscopy)	Multi-class
BreastMNIST	Ultrasound breast	Binary
OrganMNIST_(A/C/S)	CT (Axial/Coronal/Sagittal)	Multi-class
TissueMNIST	Electron microscopy	Multi-class

Sample sizes vary widely, e.g., PathMNIST with ∼90,000 training samples and BreastMNIST with just a few hundred. This scale heterogeneity supports prototyping and evaluation across low- and high-data regimes (Yang et al., 2020, Yang et al., 2021).

2. Task Diversity and Output Spaces

MedMNIST supports a range of supervised classification paradigms, reflecting diverse clinical problems (Yang et al., 2020, Yang et al., 2021):

Binary Classification: e.g., PneumoniaMNIST (disease/normal), BreastMNIST (malignant/benign vs. normal).
Multi-Class Classification: e.g., PathMNIST (9 classes), DermaMNIST (7), OrganMNIST (11 per plane).
Ordinal Regression: e.g., RetinaMNIST (5-level diabetic retinopathy scale).
Multi-Label Classification: e.g., ChestMNIST (14 potential findings).

This diversity forces algorithms to be robust to distinct loss landscapes and output structures and enables benchmarking across realistic clinical prediction scenarios.

MedMNIST+ extends MedMNIST v2, providing the same core datasets at four resolutions (28×28, 64×64, 128×128, 224×224), exposing models’ sensitivity to input quality and supporting high-resolution research (Doerrich et al., 2024). MedMNIST-C adapts ImageNet-C-style corruption robustness evaluation to medical images with domain- and modality-specific corruptions (five main corruption families, five severity levels), measuring resilience to digital artifacts, noise, blur, color variation, and clinically encountered acquisition errors (Salvo et al., 2024).

This extensible ecosystem allows for standardized comparison, transparency, and reproducibility across modalities, resolutions, and perturbation regimes.

4. Benchmarking Protocols, Baseline Models, and Evaluation Metrics

Extensive benchmarking protocols define recommended splits, augmentations, and reporting metrics (Yang et al., 2020, Yang et al., 2021, Doerrich et al., 2024, Wu et al., 24 Jan 2025). State-of-the-art CNNs (e.g., ResNet-18/50, VGG16, EfficientNet), Vision Transformers (ViT and its derivatives), AutoML pipelines (auto-sklearn, AutoKeras, Google AutoML Vision), and even quantum circuits have been evaluated.

Performance is typically assessed with:

Accuracy (ACC): $\mathrm{ACC} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$
Area Under the ROC Curve (AUC): $\mathrm{AUC} = \int_0^1 \mathrm{TPR}(\mathrm{FPR}^{-1}(x)) dx$
Balanced error and relative error metrics for robustness (MedMNIST-C): $rBE_c^f = \frac{\sum_{s=1}^{5}(BE_{s,c}^f - BE_{clean}^f)}{\sum_{s=1}^{5}(BE_{s,c}^{AlexNet} - BE_{clean}^{AlexNet})}$

Carefully controlled ablation studies, linear probing, and few-shot learning protocols enable unbiased assessment of model and modality interactions (Doerrich et al., 2024, Wu et al., 24 Jan 2025).

5. Applications in Methodological and Applied Research

MedMNIST has been employed for:

AutoML benchmarking: Enabling uniform comparison of black-box and open-source AutoML pipelines across diverse modalities and task types (Yang et al., 2020, Yang et al., 2021).
Label-efficient learning: Demonstrating improvements through advanced deep supervision and multi-level attention (e.g., LSANet) under scarce annotation regimes (Jiang et al., 2022).
Adversarial robustness and anomaly detection: Providing realistic medical benchmarks for adversarial example detection, with geometry-based metrics (density, coverage), dataset copyright technologies (DataCook), and post-hoc OOD detection strategies (Venkatesh et al., 2022, Shang et al., 2024, Lotfi et al., 17 Feb 2025).
Foundation and self-supervised models: Evaluating transfer learning and pre-training strategies (ViTs, DINO, PMC-CLIP, SimCLR, BYOL) for medical imaging robustness, cross-domain generalizability, and multi-domain representation quality (Wu et al., 24 Jan 2025, Bundele et al., 2024, Lin et al., 2023).
Quantum machine learning and generative modeling: Establishing QML baselines for low-dimensional, resource-constrained tasks; validating quantum-classical hybrid architectures for data synthesis and augmentation (Singh et al., 18 Feb 2025, Chen et al., 30 Mar 2025).
Optimal transport and manifold learning: Advancing distance-based embedding (e.g., Hellinger–Kantorovich UOT, Nyström-approximated Wasserstein matrices) for improved dimensionality reduction and clustering on medical data (Rana et al., 23 Sep 2025, Rana et al., 23 Sep 2025).

A notable property is the consistent use of simple, lightweight image formats, which reduces computational barriers, accelerates experimentation, and enables direct evaluation of model and algorithmic improvements rather than confounding by data preprocessing and scaling.

6. Accessibility, Licensing, and Community Adoption

MedMNIST and its extensions are freely available under Creative Commons licenses. All relevant codebases, APIs, evaluation scripts, and baselines are hosted on public repositories (https://medmnist.com/, https://medmnist.github.io/). The commitment to open-source distribution and rigorous dataset documentation has catalyzed widespread adoption across biomedical imaging, machine learning, AutoML, and quantum computing communities (Yang et al., 2020, Yang et al., 2021, Doerrich et al., 2024).

All users are required to cite both MedMNIST v1 and v2 as well as the original source datasets.

7. Impact and Future Prospects

MedMNIST has shaped the landscape of medical image analysis benchmarks by:

Accelerating rapid prototyping and reproducible research, particularly for resource-constrained applications and educational settings.
Enabling systematic investigation of robustness, generalization, and algorithmic design on clinically diverse, large-scale datasets—without prohibitive compute requirements.
Serving as a practical substrate for methodological innovation in AutoML, self-supervised learning, quantum ML, generative modeling, optimal transport, and robust out-of-distribution detection.

A plausible implication is that MedMNIST—and its continuing evolution—will remain a central reference in biomedical AI for the foreseeable future, particularly as further extensions integrate larger, more heterogeneous datasets, higher-resolution images, synthetic data, and curated corruptions reflecting clinical realities. The standardized evaluation framework, with controlled splits, metrics, and data modalities, is likely to inform future regulatory and translational benchmarks as AI transitions more deeply into clinical workflows.