Weighted BCE + MCC Loss for Imbalanced Segmentation

Updated 21 September 2025

The composite loss integrates weighted BCE and differentiable MCC to tackle class imbalance by combining local error control with global confusion matrix insights.
It leverages pixel-level weighting and robust gradient formulations to optimize both foreground and background classifications in dense segmentation tasks.
Empirical results in skin lesion and retinal vessel segmentation show significant performance gains, with improvements up to 11.25% in the mean Jaccard index.

Weighted Binary Cross-Entropy (BCE) plus Matthews Correlation Coefficient (MCC) Loss is a composite objective function constructed to address class imbalance and provide balanced, rigorous optimization in binary and semantic segmentation tasks. This approach unites the localized, sample-wise error penalization of weighted BCE with the global, confusion-matrix-aware robustness of MCC. Its adoption is motivated by persistent challenges in domains such as medical image segmentation and imbalanced classification, where traditional loss functions often fail to reward balanced model behavior.

1. Mathematical Formulation and Motivation

Let $y_i \in \{0,1\}$ denote the ground truth and $p_i \in [0,1]$ the predicted probability for the $i$ th sample (or pixel, for dense tasks). Weighted binary cross-entropy (BCE) loss is given by

$\mathcal{L}_{\text{BCE}} = -\frac{1}{N} \sum_{i=1}^N \left[ w_1 y_i \log(p_i) + w_0 (1 - y_i)\log(1-p_i) \right]$

where $w_1, w_0$ are class weights (usually $w_1 > w_0$ in minority-positive domains).

The differentiable Matthews Correlation Coefficient (MCC) loss is constructed from soft versions of confusion matrix entries:

$\begin{align*} \text{TP} & = \sum_i p_i y_i \ \text{TN} & = \sum_i (1-p_i)(1-y_i) \ \text{FP} & = \sum_i p_i (1-y_i) \ \text{FN} & = \sum_i (1-p_i)y_i \end{align*}$

The MCC is then

$\text{MCC} = \frac{ \text{TP}\cdot\text{TN} - \text{FP}\cdot\text{FN} } { \sqrt{ (\text{TP}+\text{FP}) (\text{TP}+\text{FN}) (\text{TN}+\text{FP}) (\text{TN}+\text{FN}) } + \varepsilon }$

where $\varepsilon$ is a small constant for numerical stability. The corresponding loss is

$\mathcal{L}_{\text{MCC}} = 1 - \text{MCC}$

The composite loss takes the form

$\mathcal{L}_\text{total} = \lambda_1 \mathcal{L}_{\text{BCE}} + \lambda_2\mathcal{L}_{\text{MCC}}$

with $\lambda_1, \lambda_2 \geq 0$ controlling the trade-off. This construction leverages the pixel-level, class-sensitive adjustments of weighted BCE and the confusion-matrix-wide balancing of MCC.

2. Addressing Class Imbalance and Limitations of Overlap-Based Losses

Standard BCE and overlap-based losses such as Dice loss or Jaccard index are known to be susceptible to class imbalance, favoring the majority class and, in the case of Dice, not penalizing true negative (background) misclassifications. MCC, by construction, incorporates all four entries of the confusion matrix (TP, TN, FP, FN) and is thus sensitive to both foreground and background misclassifications, providing a more nuanced penalization in highly imbalanced scenarios (Abhishek et al., 2020).

Empirical results in tasks such as skin lesion segmentation demonstrate that models trained with MCC-based losses outperform those trained with Dice loss, yielding improvements of up to 11.25% in mean Jaccard index on highly imbalanced datasets (Abhishek et al., 2020). The use of a weighted BCE component further alleviates imbalance by explicitly increasing the loss contribution from the minority class.

3. Composite Loss Construction and Implementation Considerations

Constructing a weighted BCE plus MCC loss involves several key considerations:

Trade-off Parameterization: The $\lambda_1$ , $\lambda_2$ hyperparameters are tuned to balance pixel-wise accuracy and global balance. For example, in retinal vessel segmentation, equal weighting showed empirical effectiveness ( $\lambda_1 = 0.5$ , $\lambda_2 = 0.5$ ), but domain-specific tuning is recommended (Guo et al., 15 Sep 2025).
Numerical Stability: The denominator of the MCC can approach zero if any confusion matrix term vanishes. It is standard to include an $\varepsilon$ additive constant (e.g., $10^{-7}$ ).
Gradient Calculation: Soft confusion matrix entries ensure that $\mathcal{L}_{\text{MCC}}$ is differentiable with respect to $p_i$ , enabling backpropagation in modern deep learning frameworks.
Weighted BCE: Careful selection of $w_1, w_0$ or more sophisticated dynamic or sample-dependent weighting schemes (e.g., distance maps incorporating spatial context (Davari et al., 2021)) can further tailor the composite loss to the domain.

4. Empirical Outcomes and Use Cases

The composite loss has been evaluated in domains with pronounced imbalance:

Skin Lesion Segmentation: Training U-Nets with $\mathcal{L}_{\text{MCC}}$ or its combination with weighted BCE produced superior Jaccard index and sensitivity/specificity trade-offs relative to Dice loss training. On ISIC 2017, the MCC loss achieved mean Jaccard index $0.7518 \pm 0.0084$ vs. $0.6758 \pm 0.0095$ for Dice loss (Abhishek et al., 2020).
Retinal Vessel Segmentation (SA-UNetv2): Employing the weighted BCE plus MCC loss with a cross-scale spatial attention network yields MCC up to $81.27$ and state-of-the-art F1/Jaccard scores on DRIVE/STARE datasets, all in a model occupying $1.2$ MB and sub-second CPU inference (Guo et al., 15 Sep 2025). This indicates practical compatibility even in constrained deployable systems.
Glacier Calving Front Segmentation: Using MCC as the stopping criterion (and improved distance-weighted BCE for the boundary) offers $15\%$ improvement in Dice coefficient compared to BCE-based early stopping, underscoring the generalizability of MCC-based criteria for rare-structure segmentation (Davari et al., 2021).

5. Extensions, Statistical Properties, and Stability

Rigorous studies have established the statistical properties of MCC, including asymptotic normality under standard conditions and the provision of asymptotic confidence intervals using the delta method and Fisher's $z$ transformation (Itaya et al., 21 May 2024, Tamura et al., 9 Mar 2025). This statistical regularity supports its use as a loss term and as an evaluation metric.

Several insights relevant for loss design arise:

Composite Loss Stability: While integrating $\mathcal{L}_{\text{MCC}}$ with BCE, careful tuning is required to prevent instability, particularly under skewed distributions where the variance of MCC increases.
Connection to Multiclass and Multilabel Scenarios: Macro- and micro-averaged MCC extensions are available; these can in principle be used alongside weighted BCE variants or other polynomial expansions (such as Asymmetric Polynomial Loss (Huang et al., 2023)) to treat non-binary settings.
Loss Function Generalization: The theoretical framework in (Marchetti et al., 2023) formalizes the link between expected weighted confusion matrix entries and custom score-oriented losses, providing a rigorous basis for generalizing the composite loss structure beyond standard BCE and MCC.

6. Comparative Analysis and Deployment Scenarios

The strengths and potential limitations of the BCE plus MCC approach are summarized as follows:

Aspect	Weighted BCE	MCC Loss	BCE + MCC Composite
Imbalance Handling	Direct via class weights	Inherent via confusion matrix	Strong (complementary)
Penalizes TN errors	No (unless weighted)	Yes	Yes
Granularity	Local (sample-wise)	Global (statistics of whole batch)	Both
Optimization Stability	High	Moderate (careful gradient needed)	Requires tuning

A plausible implication is that the composite loss is especially effective in scenarios where both local accuracy and global class-balance must be maintained (e.g., dense segmentation of rare-foreground biomedical images, highly-imbalanced event detection, or resource-constrained environments where predictable performance is crucial).

7. Prospective Directions and Theoretical Foundations

Recent theoretical work (Marchetti et al., 2023, Tamura et al., 9 Mar 2025) situates the weighted BCE and MCC combination within a broader class of score-oriented loss functions. This perspective enables extensions such as:

Custom weighting schemes informed by spatiotemporal features (e.g., onset/offset or distance maps (Song, 20 Mar 2024, Davari et al., 2021)).
Dynamic adaptation of $\lambda_1, \lambda_2$ during training driven by metric feedback or uncertainty quantification.
Surrogate or smoothed differentiable approximations of MCC for enhanced gradient stability.

The composite loss framework thus accommodates both rigorous statistical properties (asymptotic inference for MCC, robust handling of pointwise and aggregate metrics) and pragmatic needs (deployability in constrained settings, empirical enhancement on critical datasets).

Weighted BCE plus MCC loss represents a principled synthesis of local error penalization and global balance, validated both in theory and application across segmentation, detection, and classification settings where class imbalance is acute and full confusion-matrix-aware optimization is required.

PDF Markdown Chat (Pro)

References (8)

Matthews Correlation Coefficient Loss for Deep Convolutional Networks: Application to Skin Lesion Segmentation (2020)

SA-UNetv2: Rethinking Spatial Attention U-Net for Retinal Vessel Segmentation (2025)

On Mathews Correlation Coefficient and Improved Distance Map Loss for Automatic Glacier Calving Front Segmentation in SAR Imagery (2021)

Asymptotic Properties of Matthews Correlation Coefficient (2024)

Statistical Inference of the Matthews Correlation Coefficient for Multiclass Classification (2025)

Asymmetric Polynomial Loss For Multi-Label Classification (2023)

A comprehensive theoretical framework for the optimization of neural networks classification performance with respect to weighted metrics (2023)

Onset and offset weighted loss function for sound event detection (2024)

Follow Topic

Get notified by email when new papers are published related to Weighted Binary Cross-Entropy (BCE) plus Matthews Correlation Coefficient (MCC) Loss.