Balanced Cross-Entropy Loss

Updated 5 March 2026

Balanced Cross-Entropy Loss is a family of loss functions that modify standard cross-entropy by re-weighting classes based on frequency to address imbalances.
Variants such as balanced softmax and logit-adjusted cross-entropy integrate class counts and empirical priors into the loss computation to improve minority class performance.
These methods enhance performance in tasks like object detection, medical segmentation, and incremental learning, but require careful hyperparameter tuning for optimal results.

Balanced cross-entropy loss (BCE) refers to a family of loss functions designed to address class imbalance in supervised learning by modifying the standard cross-entropy (CE) objective. BCE either re-weights the per-class or per-sample CE terms by explicit class-frequency–dependent weights, adjusts logits within the softmax, or modifies the normalization to promote fairer learning and improved accuracy, especially for minority classes. This class of losses includes classical weighting, logit-adjusted CE, balanced softmax, and several theoretically motivated surrogates.

1. Mathematical Foundations and Variants

The standard CE loss, for single-label classification with $C$ classes, is

$\mathcal{L}_{\rm CE} = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C t_{i,c} \log P_{i,c}$

where $t_{i,c}$ is the one-hot ground-truth indicator for class $c$ and $P_{i,c}$ is the predicted probability.

The simplest BCE loss introduces explicit class weights $w_c$ : $\mathcal{L}_{\rm BCE} = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C w_c\, t_{i,c} \log P_{i,c}$ A canonical choice is $w_c \propto 1/f_c$ , with $f_c$ the class frequency in the training set. This formulation amplifies the contribution of rare classes and can be applied directly in object detection, semantic segmentation, and multi-class/multi-label classification settings (Phan et al., 2020, Hosseini et al., 2024).

A variant, the "balanced softmax cross-entropy," modifies the softmax activation as

$q_k(x) = \frac{n_k \exp(z_k(x))}{\sum_{j=1}^C n_j \exp(z_j(x))}$

where $n_k$ is the count for class $k$ in the training batch. The corresponding loss is $-\log q_y(x)$ . This approach directly accounts for label-frequency skew in the denominator, correcting for the mismatch between training and test class priors in class-incremental or imbalanced learning scenarios (Jodelet et al., 2021, Behnia et al., 2023).

Further generalizations use per-class logit adjustments: $L_{\rm BCE}(f;y) = -\sum_{i=1}^C y_i \log \frac{\exp(f_i - \gamma \log p_i)}{\sum_{j=1}^C \exp(f_j - \gamma \log p_j)}$ with $p_i$ the empirical class prior and $\gamma>0$ tuning the magnitude of the bias (Behnia et al., 2023). For $\gamma = \tfrac{1}{2}$ , the learned classifier geometry becomes well-balanced across all classes, both theoretically and empirically.

Several advanced surrogates, such as Generalized Logit-Adjusted (GLA) and Generalized Class-Aware (GCA) losses, expand the family to non-logistic cross-entropy forms and admit refined consistency guarantees for highly imbalanced regimes (Cortes et al., 30 Dec 2025).

2. Theoretical Properties and Consistency

Balanced cross-entropy losses are motivated by the balanced 0–1 loss

$\ell_{\rm bal}(h, x, y) = \frac{1_{h(x) \neq y}}{\pi_y}$

where $\pi_y$ is the class prior. This objective assigns equal weight to all classes, regardless of frequency, and its expectation computes the class-balanced error.

Direct minimization of balanced 0–1 loss is intractable, so surrogate losses—especially convex variants of BCE—are used. Weighted CE and balanced softmax surrogates are shown to be Bayes-consistent or $H$ -consistent under appropriate hypothesis space assumptions. GLA losses, which shift logits based on priors, and GCA losses, weighed via $1/\pi_y$ with margins, are consistent and offer excess risk bounds of $O(1/p_{\min})$ (GLA) or $O(1/\sqrt{p_{\min}})$ (GCA), with $p_{\min}$ the smallest class prior (Cortes et al., 30 Dec 2025).

The mutual information interpretation further demonstrates that, for non-uniform class distributions, a log-prior-corrected softmax preserves mutual information maximization properties of standard CE in the imbalanced regime (Qin et al., 2021).

3. Applications Across Learning Domains

Object Detection

BCE is widely used in object detection frameworks (e.g., SSD, YOLO, Faster R-CNN) to mitigate long-tail object class imbalance. For single-stage detectors, BCE is implemented via per-class weighting in the classification loss. On highly imbalanced driving datasets (BDD100K), BCE with weights $w_c=5$ for minority classes yielded an average recall gain of +18.5 points for rare classes with only a minor reduction for the majority class (Phan et al., 2020).

A distinct variant, IoU-balanced BCE, re-weights each positive example’s classification loss by its IoU with the matched ground-truth object, tightly coupling classification scores to localization quality and producing substantial gains in AP and recall at high IoU thresholds (Wu et al., 2019).

Medical Image Segmentation

In dense prediction (pixel-level) tasks, standard BCE weights each class inversely by pixel frequency. However, naive weighting often under-performs unweighted CE, generating excessive false positives along boundaries and unstable gradients for extremely rare classes. The Dilated Balanced Cross-Entropy (DBCE) variant spreads class-based weighting into boundary regions via mask dilation, stabilizing the loss and matching or surpassing region-based losses (e.g., Dice+CE) in key metrics (mDice, mIoU) (Hosseini et al., 2024).

Incremental and Long-Tail Classification

Balanced softmax cross-entropy corrects for training/test prior mismatch in class-incremental learning. By incorporating per-class sample counts in the softmax activation, the test-time class uniformity is restored, eliminating bias towards new classes induced by imbalanced incremental learning. This approach yields substantial accuracy improvements over standard softmax CE, recovering tens of percentage points in top-1 accuracy on CIFAR100 and ImageNet-incremental benchmarks (Jodelet et al., 2021).

Multi-Label Learning

Distribution-Balanced Loss extends BCE to multi-label, long-tailed datasets by combining per-sample, per-class re-weighting (accounting for co-occurrence effects) and negative-tolerant regularization. This directly addresses the over-suppression of negatives and improves mAP, particularly for tail classes in datasets such as VOC and COCO (Wu et al., 2020).

4. Practical Considerations and Implementation

Balanced cross-entropy losses are practical and require minimal changes to existing architectures:

For explicit weighting, modify the CE term with the static or dynamic weighting scheme $w_c$ .
For balanced softmax, incorporate class-counts $n_c$ into the softmax normalization.
Logit-adjustment methods require addition of $-\gamma \log p_c$ to logits prior to the softmax.
For DBCE, perform mask dilation and compute per-pixel weights during training.

Careful tuning of weight hyperparameters ( $\alpha$ for cost-sensitive BCE, $\gamma$ for logit-adjusted BCE, smoothing/margins for DBCE and distribution-balanced loss) is essential. Over-weighting rare classes can destabilize training or reduce head-class performance. For extreme imbalance, GCA and “effective number” weighting approaches may yield better robustness (Cortes et al., 30 Dec 2025, Phan et al., 2020, Wu et al., 2020).

5. Empirical Performance and Observed Gains

Extensive empirical studies document the benefits and limitations of various BCE approaches:

Task/Setting	Classic BCE Gain	Balanced Softmax / Logit-Adjusted	Notes
Object detection (BDD100K)(Phan et al., 2020)	+18.5 point mean recall (minorities), ≤1pt loss (majority)	–	Simple rare-class upweighting highly effective
Medical segmentation(Hosseini et al., 2024)	Unstable, high false positives	DBCE: matches Dice+CE, stabilizes gradients	Mask dilation crucial for boundary handling
Incremental learning(Jodelet et al., 2021)	+18.4 pp (CIFAR100 IL baseline)	+2–20 pp across settings	Addresses catastrophic forgetting bias
Multi-class classification(Behnia et al., 2023, Cortes et al., 30 Dec 2025)	+5–8 pp balanced acc vs WCE for $\gamma\sim{0.5}$	GLA best for moderate, GCA for extreme imbalance	ETF geometry attained with logit adjustment

Balanced cross-entropy is robust, easily integrated, and computationally efficient (constant-factor overhead). Its efficacy, however, is context-dependent; in certain settings (medical segmentation), region-based losses or spatially-aware weighting can outperform vanilla BCE (Hosseini et al., 2024).

6. Methodological Extensions and Limitations

Weighted BCE’s main limitations are: need for per-dataset hyperparameter tuning, lack of adaptation to instance-level difficulty, and potential gradient instability for ultra-rare categories. Advanced losses such as GLA/GCA, distribution-balanced, and IoU-balanced BCE sharpen the error–label, classification–localization, or sample–class correspondence and exhibit improved theoretical or empirical properties (Cortes et al., 30 Dec 2025, Wu et al., 2019). Hybrid schemes incorporating negative-tolerant regularization or combining BCE with region-based losses can maximize rare-class recall without sacrificing overall accuracy (Wu et al., 2020, Hosseini et al., 2024).

Automated or meta-learned balancing (as in Meta-Balanced Softmax) offers further avenues to tune trade-offs dynamically (Jodelet et al., 2021). For non-i.i.d. data streams, periodic reestimation of priors or more adaptive schemes may become necessary.

7. Summary and Outlook

Balanced cross-entropy loss and its variants provide a rigorous, theoretically grounded, and empirically validated framework to mitigate class imbalance in modern deep learning. By modulating per-class contributions, BCE enhances fairness, minority class accuracy, and overall balanced error under a variety of regimes—object detection, incremental learning, segmentation, and multi-label settings. Continuing developments include logit-adjusted, softmax-adjusted, and spatially-localized extensions, each addressing specific structural limitations of basic BCE. Calibration of balancing strategies to dataset statistics and application requirements remains an active area of research (Phan et al., 2020, Hosseini et al., 2024, Cortes et al., 30 Dec 2025, Behnia et al., 2023, Jodelet et al., 2021, Qin et al., 2021, Wu et al., 2020).