MIDOG 2025 Challenge

Updated 1 September 2025

MIDOG 2025 Challenge is a multi-track benchmark for robust mitosis detection and classification in histopathological imagery, addressing domain shift.
It employs advanced algorithms like deep ensembles, transformer models, and domain adaptation techniques to mitigate variability from scanners, staining, and tissue differences.
Performance metrics such as F1 scores above 0.75 and balanced accuracy near 0.8871 indicate its potential impact on clinical tumor grading and precision oncology.

The MIDOG 2025 Challenge is a multi-track benchmark for robust mitosis detection and classification in histopathological imagery under domain shift. It addresses the critical issue of variability induced by scanner, staining protocol, tissue, and species differences, aiming to establish scanner-agnostic and generalizable algorithms for mitotic figure recognition. The challenge comprises tasks ranging from mitotic event localization to binary subtyping of normal versus atypical mitotic figures, introducing substantial heterogeneity across datasets and demanding solutions resilient to class imbalance, morphological variability, and domain transfer effects.

1. Task Definition and Dataset Complexity

The MIDOG 2025 Challenge consists of several tracks focused on distinct but complementary objectives:

Track 1: Mitosis localization and counting—identifying coordinates of mitotic figures on whole-slide images.
Track 2: Subtyping mitotic figures as typical (NMF) or atypical (AMF), pushing the boundary from coarse to fine-grained classification.
Domain Shift Emphasis: All tracks deliberately include data from multiple scanners, tumor types, and species, and present previously unseen scanner domains for final testing.

Datasets are typically composed of annotated regions, with ground truth labels for mitosis centroids and subtypes. For Track 2, images are cropped to 128×128 pixels at standardized resolutions (e.g., 0.25 μm/pixel) and class ratios may be highly imbalanced—atypical mitoses can constitute ~20% of cases. These conditions induce a significant domain adaptation challenge, exacerbated by inter-scanner and inter-species variability.

2. Algorithmic Frameworks

The challenge has promoted a variety of algorithmic strategies, many of which are multi-stage or hybrid:

Two-Stage Detection and Refinement: Methods such as the fused detector and deep ensemble (DetectorRS + CNN ensemble) (Liang et al., 2021), hybrid segmentation/classification networks (Efficient-UNet + EfficientNet-B7) (Jahanifar et al., 2021), and cascade RCNNs (Razavi et al., 2021) localize candidates using high-recall detectors, then refine them with dedicated classifiers to improve precision.
Segmentation-Based and Transformer Models: VM-UNet with Mamba blocks for segmentation and stain augmentation (Percannella et al., 28 Aug 2025), and vision transformer approaches such as DINOv3-H+, fine-tuned with LoRA for efficient parameter adaptation (Balezo et al., 28 Aug 2025).
Classification Backbones: ConvNeXt V2 and ConvNeXtBase architectures are commonly deployed for binary mitosis subtyping, leveraging pretraining and cross-validation ensemble strategies (Yamagishi et al., 26 Aug 2025, Krauss et al., 28 Aug 2025).
Knowledge Distillation and Multitask Learning: EMA-teacher distillation combined with attention-based feature alignment and style mixing (Atey et al., 28 Aug 2025), and multitask networks incorporating both classification and auxiliary segmentation heads to improve generalization (Percannella et al., 28 Aug 2025).

The majority of approaches rely on deep ensembles, classifier heads with cross-entropy or focal loss, and test-time inference strategies such as majority voting or averaging predicted probabilities.

3. Domain Adaptation Techniques

Domain adaptation is central due to scanner and lab variability:

Stain Normalization: Classical algorithms (Macenko, Vahadane) standardize images relative to a reference, undermining color-dependent domain transfer (Liang et al., 2021, Jahanifar et al., 2021).
GAN-Based Augmentation: CycleGAN, StarGAN, and residual Cycle-GAN models synthesize images in the style of all scanner domains, allowing networks to learn from unannotated domains through data transformation (Roy et al., 2021, Hussain et al., 2021).
MixStyle and Fourier-Domain Mixing: MixStyle introduces style perturbations in feature maps, while Fourier mixing exchanges low-frequency information to achieve stain invariance (Atey et al., 28 Aug 2025, Aubreville et al., 2022).
Gradient Reversal Layer in Domain-Adversarial Training: Applied to encourage feature independence from the scanner source (Aubreville et al., 2022).

Stain augmentation generally involves decomposing images into stain and concentration matrices, perturbing these during training, and reconstructing in RGB via exponential transformations (e.g., $I = I_{0} \exp(-S(\alpha C + \beta))$ (Percannella et al., 28 Aug 2025)).

4. Data Augmentation and Training Protocols

Comprehensive data augmentation is universally applied:

Spatial and Color Augmentations: Transposition, rotation, scale, shift, color jitter across channels, cutout, gamma correction, and hue rotation (Lafarge et al., 2021).
Stain-Specific Augmentation: Random alterations of stain matrices to simulate lab variability.
Ensembling and Cross-Validation: Multiple cross-validation folds, averaging predictions, and majority vote ensembles are standard for robust estimation (Yamagishi et al., 26 Aug 2025).

Loss functions include cross-entropy, focal loss, Jaccard loss (segmentation), and Dice loss (for regularization via auxiliary heads).

Mixed precision training and gradient clipping are adopted for computational efficiency, with batch sizes typically ranging from 16 to 64 and rapid convergence in 5–50 epochs.

5. Performance Metrics and Benchmarking

The primary metric for localization tasks is the F1 score, calculated as

$F_{1} = \frac{2 \sum_{k} TP_{k}}{2 \sum_{k} TP_{k} + \sum_{k} FN_{k} + \sum_{k} FP_{k}}$

where detection is defined by spatial proximity (e.g., <7.5μm Euclidean distance to ground truth). For classification tasks, balanced accuracy (BA) and ROC AUC are emphasized, computed as

$BA = \frac{TPR + TNR}{2}$

where TPR and TNR correspond to sensitivity and specificity.

Headline results across MIDOG 2025 and precursor challenges are summarized as follows:

Model / Paper	Metric	Challenge/Test Set	Value
DetectorRS + Deep Ensemble (Liang et al., 2021)	F1 score	MIDOG preliminary	0.7550
Efficient-UNet + EfficientNet-B7 (Jahanifar et al., 2021)	F1 score	MIDOG preliminary	0.765
Cascade RCNN (Razavi et al., 2021)	F1 score	MIDOG preliminary	0.7492
Mask-RCNN + Cycle-GAN (Roy et al., 2021)	F1 score	MIDOG preliminary	0.7578
VM-UNet + Mamba + Stain Aug (Percannella et al., 28 Aug 2025)	F1 score	MIDOG++/Prelim	0.754
ConvNeXt V2 Ensemble (Yamagishi et al., 26 Aug 2025)	BA (cross-val)	MIDOG25 Track 2	0.8314
DINOv3-H+ + LoRA (Balezo et al., 28 Aug 2025)	BA	MIDOG25 preliminary	0.8871
MixStyle + CBAM + Distil (Atey et al., 28 Aug 2025)	BA	MIDOG25 preliminary	0.8762
Deep Ensemble + RBR (Krauss et al., 28 Aug 2025)	BA	MIDOG25 preliminary	0.8402
Multi-Task Learning (Percannella et al., 28 Aug 2025)	BA	MIDOG25 preliminary	0.856

Approaches achieving F1 scores above 0.75 were considered expert-level for localization; BA ≈ 0.88 for classification demonstrated strong potential for clinical translation.

6. Current Limitations and Future Directions

Despite substantial progress, several limitations persist:

Performance drop on unseen domains remains evident, especially for unseen scanners and tissues.
Stain normalization and elaborate augmentation can sometimes fail to close the generalization gap, necessitating further advances in adversarial domain adaptation, meta-learning, and more sophisticated stain modeling.
Rule-based refinement modules can improve specificity but may reduce sensitivity and overall balanced accuracy, indicating that integration of domain knowledge must be approached cautiously (Krauss et al., 28 Aug 2025).
Scarcity of atypical mitoses and pronounced class imbalance require continual refinement of loss functions, sampling methods, and evaluation protocols.
Emerging directions include end-to-end joint training of detector and classifier, further leveraging foundation models pretrained on massive datasets (e.g., DINOv3, ConvNeXt), and expanded self-/semi-supervised learning schemes to exploit unannotated sources (Balezo et al., 28 Aug 2025).

7. Clinical and Research Implications

The MIDOG 2025 challenge establishes rigorous benchmarks and open datasets for scanner-agnostic and domain-robust mitosis detection, with direct clinical relevance in tumor grading and prognostication. It has catalyzed adoption of multiple innovative approaches from multitask learning and knowledge distillation to advanced stain and style perturbation methods. The competitive performance of foundation models and Mamba-based segmentation architectures suggests that future algorithms will increasingly synergize large-scale pretraining, domain-aware adaptation, and efficient fine-tuning for reliable histopathological event recognition.

A plausible implication is that, as domain generalization techniques mature, automated mitosis assessment will become increasingly standardized across institutions and acquisition modalities, reducing inter-observer variability and supporting precision oncology. The continued development of modular, multi-task learning frameworks and parameter-efficient adaptation methods is poised to further improve diagnostic performance and scalability in real-world clinical environments.