Multi-class Dice Loss for Segmentation

Updated 21 November 2025

Multi-class Dice Loss is a loss function that measures and optimizes region overlap across multiple classes, making it particularly effective for imbalanced segmentation tasks.
It employs a soft, differentiable surrogate of the Dice coefficient with various weighting strategies such as uniform, inverse-frequency, and inverse-squared methods to tackle class imbalance.
Recent extensions including PM Dice, DSC++, and Wasserstein Dice enhance calibration, boundary precision, and adaptability to missing labels while accounting for semantic relationships.

Multi-class Dice loss is a fundamental class of loss functions for measuring and optimizing region overlap in semantic segmentation tasks with more than two classes, particularly in highly imbalanced contexts such as multi-organ or tumor segmentation in medical imaging. By directly optimizing a soft, differentiable surrogate of the Dice similarity coefficient averaged (or weighted) over classes, this loss family provides robust alternatives to voxel-wise cross-entropy, yielding superior training dynamics and final accuracy on tasks evaluated by region overlap scores. Recent developments extend the classic multi-class Dice to address class imbalance, inter-class semantic distances, instance- and pixel-level weighting, empty or missing labels, and network output calibration.

1. Standard Multi-class Dice Loss: Formulation and Rationale

For $L$ classes, with softmax outputs $p_{i}^l$ and one-hot ground truth $r_{i}^l$ at each voxel or pixel $i$ , the continuous (differentiable) Dice loss per class $l$ is: $\mathcal{L}_{DSC}^{(l)} = -\frac{2 \sum_i p_i^l\,r_i^l}{\sum_i p_i^l + \sum_i r_i^l}$ The standard multi-class Dice loss then averages this over classes: $\mathcal{L}_{\mathrm{multi-Dice}} = 1 - \frac{1}{L}\sum_{l=1}^L \frac{2 \sum_i p_i^l\,r_i^l}{\sum_i p_i^l + \sum_i r_i^l}$ Alternatively, per-class averaging can be replaced by summing numerator and denominator across all classes before forming the ratio. This Dice loss is robust to class imbalance as the overlap is normalized by region size, reducing domination by frequent classes (Shen et al., 2018, Eelbode et al., 2020, Kodym et al., 2018, Yeung et al., 2021).

2. Weighting and Class Imbalance Strategies

Several weighting schemes have been proposed to further handle severe class imbalance, which can be critical in medical segmentation with rare anatomical structures or lesions:

Uniform weighting ( $w_l=1$ ): All classes contribute equally; sufficient in modest imbalance with small learning rate (Shen et al., 2018).
Inverse-frequency ( $w_l = N/(L|R_l|+1)$ ): Each class is weighted inversely proportional to its voxel count. Rarer classes thus have more impact.
Inverse-squared-frequency ( $w_l = N/(L|R_l|^2+1)$ ): Further amplifies rare class impact; effective only with high learning rates to avoid vanishing gradients for large classes.
Generalised Dice Loss (GDL): Uses $w_l = 1/(\sum_n r^l_n)^2$ — the squared inverse of true-class cardinality — for each batch (Sudre et al., 2017).

The class weighting scheme interacts strongly with the optimizer's learning rate: aggressive weighting can under-train large structures unless the learning rate is increased, while uniform weighting is more stable at moderate rates. In abdominal CT, “simple” weighting with a high learning rate yields the best average Dice across organs (Shen et al., 2018). GDL, using inverse-squared-frequency, outperforms naïve Dice and cross-entropy at severe imbalance (Sudre et al., 2017).

3. Extensions: Handling Small Structures, Difficult Pixels, and Missing Labels

To address not only class imbalance but also difficulty imbalance and edge cases:

Pixel-wise Modulated Dice (PM Dice): Introduces a per-pixel modulator $m_{c,i}=|y_{c,i}-p_{c,i}^{\text{stop}}|^{\gamma_c}$ , up-weighting hard-to-predict pixels in both numerator and denominator. $\gamma_c$ controls focus on low-confidence pixels. PM Dice consistently improves small-object and boundary Dice without expensive pixel-ranking (Hosseini, 17 Jun 2025).
Batch Soft Dice: Computes Dice over the entire batch, combining all instances and classes into global overlap for more stable gradients, particularly for small structures (Kodym et al., 2018).
Adaptive t-vMF Dice: Replaces cosine similarity in Dice with t-vMF similarity, introduces a per-class compactness parameter $\kappa$ , and adapts $\kappa$ during training based on validation DSC, further increasing performance in imbalanced and multi-class problems (Kato et al., 2022).
Missing/empty labels: The choice of reduction set $\Phi$ (i.e., what is included in sums) and smoothing $\epsilon$ determines whether missing labels are ignored (default) or the network is encouraged to output empty masks for missing classes. Batch-wise reductions with large $\epsilon$ allow the network to learn “empty” outputs where needed; image-wise with small $\epsilon$ naturally ignores missing labels (Tilborghs et al., 2022).

4. Semantically Informed and Calibration-aware Generalizations

Advanced variants introduce semantic relationships and calibration robustness:

Generalised Wasserstein Dice: Replaces simple overlap with a Wasserstein (optimal transport) distance in label space, enforcing higher penalties for semantically distant class misclassifications. The cost matrix $M$ must be constructed to reflect inter-class semantics; the loss is still fully differentiable (Fidon et al., 2017).
DSC++ loss: Augments Dice by raising false positive/negative contributions to an exponent $\gamma>1$ , which penalizes overconfident errors and improves calibration (Brier, log-loss) without degrading region overlap. Empirically, DSC++ matches cross-entropy in calibration and surpasses vanilla Dice on small structures (Yeung et al., 2021).
Unified Focal Loss: Combines a focal cross-entropy with a focal Tversky/Dice term, allowing simultaneous tuning of class and difficulty imbalance handling. A single $\delta$ controls the balance between FP and FN, and a focal exponent $\gamma$ shapes the focus on hard examples. Unified Focal Loss outperforms vanilla Dice for under-represented classes, especially in 3D multi-class segmentation (Yeung et al., 2021).

5. Implementation Details and Practical Considerations

Numerical stability: Always use a smoothing constant $\epsilon$ (e.g., $10^{-6}$ ) in numerator and denominator to avoid division by zero, especially for rare or missing classes (Sudre et al., 2017, Tilborghs et al., 2022).
Reduction dimension: Choose image-wise for missing labels (to ignore missing), batch-wise for empty label learning (to use presence elsewhere in the batch), and “all-wise” for full global Dice (Tilborghs et al., 2022, Kodym et al., 2018).
Optimizer and learning rate: For aggressive class balancing, increase initial learning rate (e.g., from $0.001$ to $0.01$) and monitor slow convergence for small classes (Shen et al., 2018, Sudre et al., 2017).
Pseudo-codes and frameworks: GDL, PM Dice, Wasserstein Dice, and DSC++ all have direct PyTorch/TF implementations; weights, modulations, and reductions can be efficiently vectorized (Sudre et al., 2017, Hosseini, 17 Jun 2025, Yeung et al., 2021, Fidon et al., 2017).
Batch/patch sampling: Balance foreground/background distributions across the batch to guarantee rare class presence for weights, or use inflated $\epsilon$ otherwise (Sudre et al., 2017, Tilborghs et al., 2022).

6. Empirical Results and Guidelines

Loss Variant	Main Effect	Imbalance Handling	Additional Benefits
Standard multi-class Dice	Direct region overlap, macro avg	Partial	Simple, widely adopted
Generalised Dice (GDL)	Inverse-squared weighting	Extreme	Best for severe rarities
Batch Soft Dice	Batch/global aggregation	Mild/strong	Stable, best for small labels
PM Dice	Per-pixel difficulty modulation	Class & difficulty	Improved boundaries, minimal cost
DSC++	Penalizes overconfident errors	Class, calibration	Top NLL/Brier, superior on rare
Wasserstein Dice	Semantic class distances	Class & semantics	Meaningful errors, task prior
Unified Focal (Tversky)	Focal, FP/FN trade-off	Class & difficulty	Robust convergence, recall/precise

Empirical studies demonstrate that (i) metric-sensitive multi-class Dice–based losses consistently outperform cross-entropy (even with class weighting), (ii) GDL or “simple” weighted Dice best address severe foreground rarity, and (iii) PM Dice, DSC++ and focal-Tversky further improve accuracy on boundaries and small structures, and provide calibrating/robustness gains (Shen et al., 2018, Sudre et al., 2017, Hosseini, 17 Jun 2025, Yeung et al., 2021, Yeung et al., 2021).

7. Current Directions and Limitations

Recent research emphasizes the importance of describing the precise reduction (batch/image/class-wise) and $\epsilon$ in Dice implementations, as their interplay determines performance with missing or empty structures (Tilborghs et al., 2022). The main limitations of advanced schemes include the need for tuning (e.g., learning rate, focal exponents, or compactness parameters), possible overfitting or undertraining of large classes when aggressively reweighting, and the necessity to define task-specific semantic distance matrices (Wasserstein Dice). Nonetheless, multi-class Dice remains the dominant loss for direct overlap optimization in segmentation tasks with strong class imbalance, highly variable region sizes, and clinical or scientific relevance (Eelbode et al., 2020, Fidon et al., 2017).