Gland-Level Dice Score

Updated 20 November 2025

Gland-level Dice score is a metric that assesses instance segmentation accuracy of glandular structures by matching each predicted gland to its corresponding ground truth.
It employs area-based weighting and symmetric aggregation to penalize false positives, false negatives, and errors from merged or split gland instances.
This metric is crucial in clinical imaging, offering robust and interpretable evaluations for histopathological and radiological segmentation models.

Gland-level Dice score, also referred to as object-level Dice, is a quantitative metric designed to evaluate the accuracy of instance-level segmentation of glandular structures in biomedical images, especially in contexts where individual gland detection and delineation is essential, such as colorectal, prostate, parotid, and pituitary gland segmentation in histopathology and radiology. Unlike the classical pixel-level Dice coefficient, which treats the entire mask as a single foreground class, the gland-level Dice addresses the granularity and clinical relevance of separate gland instances by explicitly matching and scoring each predicted gland to its best corresponding ground-truth gland, and vice versa. This dual-object matching and aggregation form the foundation of its discriminative power in instance segmentation tasks.

1. Mathematical Definition and Aggregation

The gland-level Dice score is an extension of the pixel-level Dice coefficient. For sets of pixels $G$ (ground-truth gland) and $S$ (segmented gland), the classical Dice is defined as:

$\mathrm{Dice}(G, S) = \frac{2 |G \cap S|}{|G| + |S|}$

For a collection of $n_G$ ground-truth glands $\{G_1, ..., G_{n_G}\}$ and $n_S$ segmented objects $\{S_1, ..., S_{n_S}\}$ , the computation proceeds as follows (Sirinukunwattana et al., 2016, Chen et al., 2016):

Matching: Each segmented gland $S_j$ is matched to the ground-truth gland $G_i$ with which it has maximal pixel overlap, and vice versa.
Weighted Aggregation: For each $G_i$ , compute the Dice score with its best matching $S^*(G_i)$ , weighted by the fraction of the total ground-truth area $|G_i|/\sum_{k=1}^{n_G}|G_k|$ . Similarly, for each $S_j$ , compute Dice versus its best matching $G^*(S_j)$ , weighted by $|S_j|/\sum_{l=1}^{n_S} |S_l|$ .
Symmetrization and Averaging: The final gland-level Dice is given by

$\mathrm{Dice}_{\text{obj}} = \frac{1}{2}\left[\sum_{i=1}^{n_G} \gamma_i\,\mathrm{Dice}(G_i, S^*(G_i)) + \sum_{j=1}^{n_S} \sigma_j\,\mathrm{Dice}(G^*(S_j), S_j) \right]$

where $\gamma_i = |G_i| / \sum_{k=1}^{n_G} |G_k|$ and $\sigma_j = |S_j| / \sum_{l=1}^{n_S} |S_l|$ .

This symmetric formulation penalizes both false negatives (missed glands) and false positives (spurious glands), and rewards accurate, size-proportionate matches. Unmatched glands (zero overlap) contribute zero to their respective sum but their weights remain, ensuring detection sensitivity. No hard overlap threshold is imposed, so minor or partial overlaps contribute proportionally to the final score—large or complete misses are inherently penalized.

2. Implementation Protocols and Instance Matching

The computation of gland-level Dice involves specific post-processing and matching steps, optimized for histological and radiological images (Sirinukunwattana et al., 2016, Chen et al., 2016, Wang et al., 2021, Xie et al., 2020, Graham et al., 2018):

Post-processing: Segmentation maps are typically thresholded to obtain binary masks. Morphological clean-up (e.g., hole filling, small-object removal) is often applied.
Connected-Component Labeling: Binary masks are partitioned into discrete segmented objects using connectivity analysis (usually 4- or 8-connectivity in 2D, or their 3D equivalents).
One-to-One Matching: Each ground-truth instance is matched by maximal overlap (intersection cardinality or IoU). Although the basic method is greedy, some studies (e.g., (Wang et al., 2021)) employ the Hungarian algorithm to avoid duplicated assignments in certain ambiguous cases.
Evaluation: Scores are computed per image, then typically averaged over all test images for reporting. In multi-gland images, per-image object Dice may be further summarized as dataset means, sometimes with standard deviations or confidence intervals.

Unmatched objects (e.g., false positives with no counterpart or false negatives) have Dice=0, ensuring these critical errors are reflected in the aggregated metric.

Pixel-level Dice is computed as

$\mathrm{Dice}_{\text{pixel}} = \frac{2|\text{Prediction} \cap \text{GroundTruth}|}{|\text{Prediction}| + |\text{GroundTruth}|}$

This measure can be misleading in instance segmentation because large, correctly segmented glands can mask the omission or merging of smaller glands (Chen et al., 2016, Xie et al., 2020, Graham et al., 2018). The gland-level Dice, by contrast, is sensitive to:

Missed glands (FN): Glands present in ground truth but not segmented count as zero in the average, weighted by area.
Spurious glands (FP): Segmentations with no corresponding ground truth also contribute zero to the metric, mitigating oversegmentation.
Merges and Splits: One-to-one matching ensures that merged predictions (single segmented object covering multiple glands) or splits (one gland split into several predicted objects) are penalized through reduced Dice values per instance.

Because of these properties, gland-level Dice is regarded as more representative of clinically meaningful segmentation accuracy, particularly in tasks with substantial gland heterogeneity, clustering, or varying morphology (Chen et al., 2016, Wang et al., 2021).

4. Empirical Results and State-of-the-Art Performance

Gland-level Dice is an established primary metric in major segmentation challenges and publications. A summary of object-level Dice results across datasets and methods is provided below (collated from (Chen et al., 2016, Wang et al., 2021, Xie et al., 2020, Graham et al., 2018, Jian et al., 2023)):

Model / Study	Dataset	Gland-level Dice (%)
DCAN (CUMedVision2)	GlaS-A	89.74
DCAN (CUMedVision2)	GlaS-B	78.10
TA-Net	GlaS	90.2
PRS² (semi-supervised)	GlaS	90.6
MILD-Net	GlaS-A	91.3
MILD-Net	GlaS-B	83.6
DFGET (Graph-based)	GlaS	93.53

Gland-level Dice has also been extended to other glandular structures with appropriate modifications. For instance, whole-prostate and prostate central gland segmentation models report Dice in the 77–87% range in histological and MRI studies, while parotid and pituitary gland segmentation models have achieved values from 60% up to 89% depending on neural architecture, dataset size, and segmentation protocol (Motamed et al., 2019, Yakubu et al., 24 Jun 2025, Xu et al., 2022, Egger, 2013). For small or low-contrast glands (pituitary), semi-automatic or correction-aided pipelines often outperform fully automatic methods by as much as 10–15 percentage points.

5. Clinical and Practical Significance

The gland-level Dice metric is favored for its high sensitivity to clinically significant segmentation failures, particularly:

Detection accuracy: Penalizes undetected glands, reducing the risk of missing diagnostically relevant structures.
Instance delineation: Sensitive to gland merging/splitting and imprecise boundaries, aligning with pathologists’ criteria for valid segmentation (Chen et al., 2016, Graham et al., 2018).
Ranking segmentation algorithms: Provides clear discrimination among competing models in challenge settings, aiding peer benchmarking (Jian et al., 2023, Xie et al., 2020).

Clinical standards often cite gland-level (object-level) Dice values above 80% as acceptable for automated deployment in diagnosis and preoperative planning, especially in the prostate, parotid, and pituitary domains (Motamed et al., 2019, Yakubu et al., 24 Jun 2025).

6. Variants, Pitfalls, and Ongoing Developments

Several variants of the gland-level Dice have been proposed:

Averaging strategies: While the GlaS symmetric average is widely adopted, some studies compute per-gland Dice only over ground-truth objects, omitting the symmetric term over predictions (Wang et al., 2021).
Thresholding: Some protocols further filter predictions by imposing minimal overlap thresholds (e.g., IoU > 0.5) to remove trivial detections; however, GlaS-style metrics rely solely on the maximal-overlap assignment with zero score for unmatched entities.
Object weights: Area-based weighting, rather than uniform, ensures that larger or more diagnostically relevant glands dominate the overall score—a key distinction for pathological images with wide gland size distributions.
Uncertainty and Quality Calibration: Methods including test-time augmentation-based uncertainty mapping optionally discard ambiguous glands to produce higher object-level Dice on the remaining structures, supporting practical clinical triage (Graham et al., 2018).

Potential pitfalls include metric inflation via post-processing heuristics (e.g., overzealous small-object removal or test-time ensembling) and lack of robustness in highly heterogeneous domains, for which careful validation and cross-center studies are advised (Yakubu et al., 24 Jun 2025, Xu et al., 2022).

7. Example Calculation and Interpretive Summary

A toy example from the GlaS challenge (Sirinukunwattana et al., 2016) illustrates the calculation:

Ground truth glands: $G_1$ (100 px), $G_2$ (50 px)
Predicted glands: $S_1$ (90 px), $S_2$ (40 px)
Overlap matrix:
- $|G_1 \cap S_1| = 80$ px, $|G_1 \cap S_2| = 5$ px
- $|G_2 \cap S_2| = 30$ px
Dice( $G_1$ , $S_1$ ) ≈ 0.842, Dice( $G_2$ , $S_2$ ) ≈ 0.667
Weighting: $\gamma_1 = 0.667$ , $\gamma_2 = 0.333$ ; $\sigma_1 = 0.692$ , $\sigma_2 = 0.308$
Aggregation yields object-level Dice $\approx$ 0.786

This calculation demonstrates how object-level Dice aggregates per-gland overlap, penalizes unpaired glands, and emphasizes larger structures through weighting. As such, it provides a robust, interpretable metric for both algorithm development and clinical validation across the anatomical and imaging spectrum (Sirinukunwattana et al., 2016, Chen et al., 2016, Jian et al., 2023).