MedBank-100k: Multi-Task Segmentation Dataset
- MedBank-100k is a comprehensive dataset offering 122,594 image-mask pairs across 7 medical imaging modalities for multi-task segmentation research.
- It employs standardized preprocessing—including frame filtering and aspect-ratio checks—to harmonize heterogeneous annotation formats for robust evaluation.
- Benchmarking reveals SAMed-2 achieving superior Dice scores compared to baselines, underscoring the dataset’s value for reproducible, multi-domain segmentation studies.
MedBank-100k is a comprehensive, large-scale medical image segmentation dataset curated for benchmarking and training multi-task foundation models. Comprising 122,594 frame–mask pairs sourced from publicly released datasets, it covers seven principal medical imaging modalities and includes 21 distinct segmentation tasks. Designed to support robust evaluation of segmentation architectures such as SAMed-2, MedBank-100k addresses the complexity of heterogeneous sources and annotation formats through standardized preprocessing, enabling large-scale, multi-domain segmentation research (Yan et al., 4 Jul 2025).
1. Composition and Modalities
MedBank-100k consists of 122,594 images, each paired with a corresponding segmentation mask. The dataset amalgamates data from a diverse range of public medical segmentation corpora, though the number of unique patients and 3D volumes is not specified. The modalities and task distribution are as follows:
| Modality | # Tasks | # Images |
|---|---|---|
| Fundus | 1 | 559 |
| Dermoscopy | 1 | 2,621 |
| X-Ray | 1 | 23,822 |
| CT | 10 | 34,521 |
| MR | 6 | 19,522 |
| Colonoscopy | 1 | 3,838 |
| Echocardiography | 1 | 1,800 |
| Others | – | 35,911 |
The relative proportions are formally defined as , for each modality , e.g., , , etc.
The dataset covers a broad range of anatomical structures and lesion types, with multi-class masks separated by class during preprocessing, resulting in binarized masks per channel. However, no explicit class counts per task or class distributions are provided.
2. Data Partitioning and Splits
The dataset is divided at the image level into 90% training and 10% test splits:
- Training set: images
- Test set: $122,594 - 110,334 = 12,260$ images
No separate validation set is mentioned, nor are k-fold cross-validation partitions present. External zero-shot evaluations are performed on 10 distinct datasets, inheriting their respective splits rather than those of MedBank-100k itself.
3. Annotation Sources and Protocols
All segmentation masks originate from existing public datasets (e.g., MS-Decathlon, ISIC, Drishti-GS), with no new manual annotation conducted for MedBank-100k. The paper does not specify annotation guidelines, detailed protocols, or measures of inter-observer agreement such as Cohen’s . There is no harmonization of annotation standards across tasks and modalities, and no reported quality control procedures.
A plausible implication is that heterogeneity in annotation may introduce variable mask quality and class definitions across modalities and tasks.
4. Preprocessing and Data Standardization
To manage the heterogeneity of input sources, MedBank-100k undergoes four documented preprocessing procedures:
- Video data: Drop frames where the segmentation mask sums to zero (i.e., no object labeled).
- 2D images: Random shuffling, while preserving temporal or volumetric order for sequences or 3D slices.
- Aspect-ratio filter: Remove images where the shorter edge is less than half the length of the longer edge, mitigating distortion from resizing.
- Multi-class separation: Any mask with multiple classes is split into one binary mask per class channel.
Details concerning intensity normalization protocols, voxel-wise standardization, histogram equalization, or additional noise-handling procedures are not specified beyond reference to “standardized and normalized” images.
The only explicit exclusion criteria are the zero-label rule and aspect-ratio filtering.
5. Benchmarking and Evaluation Framework
MedBank-100k primarily serves as a benchmarking substrate for segmentation architectures, including the SAMed-2 selective memory model. The Dice Similarity Coefficient (DSC) is used as the sole metric for quantitative evaluation across all internal and external benchmarks.
Benchmark results on all 21 internal tasks yield the following average DSC scores:
External zero-shot evaluation on 10 tasks shows:
| Model | Avg DSC |
|---|---|
| SAMed-2 | 0.6938 |
| MedSAM-2 | 0.5796 |
| SAM2 | 0.4375 |
| MedSAM | 0.6277 |
| SAM | 0.5958 |
| U-Net | 0.6879 |
These results indicate superior multi-task performance for SAMed-2 relative to all compared baselines.
6. Code Availability and Implementation
All scripts and pre-trained checkpoints for MedBank-100k, as well as for training and inference using SAMed-2, are publicly available at https://github.com/ZhilingYan/Medical-SAM-Bench. The repository includes utilities for downloading the dataset, implementing the four-step preprocessing workflow, executing split partitioning, and integrating MedBank-100k into PyTorch pipelines.
A plausible implication is that MedBank-100k enables reproducible research and extensible benchmarking across a range of segmentation models and modalities.
7. Significance and Limitations
MedBank-100k provides a scale and diversity of medical segmentation data well-suited for multi-modal, multi-task learning, addressing major challenges in continual learning and noisy annotation environments. Its assembly facilitates comparison across foundation models and classic CNNs for medical image segmentation.
However, limitations include absent reporting of patient-level statistics, lack of annotation harmonization, missing per-task class counts, and unspecified intensity normalization protocols. These gaps may affect generalizability and consistency but are partly offset by the extensive public codebase and clear benchmarking methodology (Yan et al., 4 Jul 2025).
MedBank-100k thus represents a significant resource for multi-domain segmentation research, with potential for extension and systematic analysis contingent upon future improvements in annotation and statistical reporting.