Papers
Topics
Authors
Recent
Search
2000 character limit reached

Class-Weighted Cross-Entropy Loss

Updated 25 February 2026
  • Class-weighted cross-entropy loss is a variant of standard cross-entropy that applies per-class weights to counteract class imbalance.
  • It employs static, adaptive, and structured weighting strategies to enhance gradient signals for minority or high-cost classes.
  • This loss function is used in object detection, medical imaging, and segmentation, with extensions like hierarchical and similarity-weighted variants.

Class-weighted cross-entropy loss is a modification of standard cross-entropy loss designed to address class imbalance in classification and segmentation tasks. By reweighting the loss contributions from each class, either globally or at the pixel/instance level, it ensures improved gradient signal for under-represented or high-cost classes. Numerous variants and extensions—ranging from simple inverse-frequency schemes, adaptive weighting, spatially-structured weighting, to hierarchical and class-distance-based formulations—have been developed for different modalities and downstream objectives.

1. Mathematical Definition and Formulation

In multiclass classification, given CC classes, softmax predictions p(ci)p(c_i), and one-hot ground-truth tit_i, the standard categorical cross-entropy is:

LCE=i=1Ctilogp(ci)L_\mathrm{CE} = -\sum_{i=1}^C t_i \log p(c_i)

The class-weighted variant introduces per-class weights wciw_{c_i}:

LWCE=i=1Cwcitilogp(ci)L_\mathrm{WCE} = -\sum_{i=1}^C w_{c_i} t_i \log p(c_i)

where wciw_{c_i} is typically a function of class frequency or domain-defined costs. For binary classification, this simplifies to the familiar positive/negative class weighting. Class weights can be constant (e.g., wc1/freq(c)w_{c} \propto 1/\text{freq}(c)), derived from effective number of samples, or computed adaptively per iteration.

A comprehensive theoretical treatment shows that weighted cross-entropy, for any positive vector ωc\omega_c, minimizes the expected weighted error, formalized in score-oriented loss theory (Marchetti et al., 2023). This includes the classical cost-sensitive and frequency-balanced losses as special cases, with explicit guarantees on expected error minimization in the weighted metric.

2. Weight Computation: Static, Adaptive, and Structured Approaches

The choice and computation of class weights are critical for practical effectiveness:

  • Static Inverse-Frequency: wc=NAll/(NLabelsNc)w_c = N_\mathrm{All}/(N_\mathrm{Labels}N_c) for class cc, ensuring that rare classes contribute more to the loss (Villar et al., 2023).
  • Effective Number of Samples: wc=(1β)/(1βnc)w_c = (1-\beta)/(1-\beta^{n_c}) with β1\beta\to 1 amplifying weights for small ncn_c (Phan et al., 2020).
  • Heuristic Weights: Manually assign wc=α>1w_c=\alpha>1 for minority classes, wc=1w_c=1 for majority/background (Phan et al., 2020).
  • Ordered Weighted Average (OWA): Weights are reassigned each iteration to classes with the current largest loss contributions, governed by a linguistic quantifier function (Maldonado et al., 2023).
  • Recall-based Adaptive Weights: Per-class weights are computed dynamically as wc,t=1recallcw_{c,t}=1-\text{recall}_c, focusing on classes with low recall and relaxing as recall improves (Tian et al., 2021).
  • Spatial and Geometric Weighting: Pixel-wise or edge/boundary weights for segmentation or structure-aware applications (Hosseini et al., 2024, Guerrero-Pena et al., 2018, Shu, 9 Jul 2025).
  • Hierarchical Weights: When class labels are structured in a tree, class weights and level weights are combined, enabling cross-entropy over each parent-child softmax (Villar et al., 2023).

In many implementations, weights are normalized (for example, their mean is set to unity) to avoid loss scaling issues.

3. Extensions: Hierarchical, Distance, and Similarity-weighted Losses

Hierarchical Cross-Entropy

In hierarchical taxonomies, Villar et al. introduce weighted hierarchical cross-entropy (WHXE) where class weights W(c(h))W(c^{(h)}) and level weights λ(c(h))\lambda(c^{(h)}) are applied at all tree levels:

LWHXE=h=0H1W(c(h))λ(c(h))logp(c(h)c(h+1))\mathcal{L}_\mathrm{WHXE} = -\sum_{h=0}^{H-1} W(c^{(h)}) \lambda(c^{(h)}) \log p(c^{(h)} | c^{(h+1)})

This structure allows for flexible classification in tree-structured domains, such as astrophysical transient taxonomies, generalizing the flat class-weighted CE as a special case (H=1H=1) (Villar et al., 2023).

Class Distance-weighted Cross-Entropy

For ordinal tasks, class distance weighted cross-entropy (CDW-CE) penalizes errors by distance in label space:

LCDWCE(y^,y)=i=0N1icαlog(1y^i)L_\mathrm{CDW-CE}(\hat{y}, y) = -\sum_{i=0}^{N-1} |i - c|^\alpha \log(1 - \hat{y}_i)

where cc is the true class, α\alpha adjusts penalty sharpness. This directly embeds ordinal structure; distant misclassifications are penalized more than near ones. CDW-CE outperforms classical CE, CORN, CO2, and HO2 ordinal losses on both accuracy and interpretability (CAM quality) in medical imaging (Polat et al., 2022, Polat et al., 2024).

Similarity-weighted Cross-Entropy

SimLoss generalizes class-weighted CE with a class similarity matrix SS:

LSimLoss=1Ni=1Nlog(c=1CSyi,cpi[c])L_\mathrm{SimLoss} = -\frac{1}{N}\sum_{i=1}^N \log\left(\sum_{c=1}^C S_{y_i,c}p_i[c]\right)

This allows explicit modeling of semantic/ordinal relations: Si,jS_{i,j} encodes the similarity between classes ii and jj, constructed via knowledge or embedding similarity. SimLoss strictly generalizes class-weighted CE (retrieved with S=IS=I) (Kobs et al., 2020).

4. Applications and Empirical Performance

Class-weighted cross-entropy and its variants have been systematically validated in domains suffering from severe label imbalance or needing specialized error costs:

  • Object Detection: Weighted CE, Focal Loss, and class-balanced loss improve minority-class recall (e.g., “Bike” recall from 19.1% to 49.1% or higher), with focal loss providing best overall recall when combined with effective sample weighting (Phan et al., 2020).
  • Medical and Biological Segmentation: Multiclass weighted losses using per-pixel class weights, and spatial/shape-aware schemes (DWM, SAW, DBCE) yield substantial improvements in boundary F1 and instance recall, outperforming focal and standard CE losses (Hosseini et al., 2024, Guerrero-Pena et al., 2018).
  • Semantic Segmentation: Fixed class-weighted CE can lead to excessive false positives for minority classes; recall-adaptive weighting rectifies over-emphasis and improves both mean accuracy and mean IoU vs. static weighting (Tian et al., 2021).
  • Ordinal/Hierarchical Classification: Class distance-weighted and hierarchical cross-entropy losses consistently provide superior scores such as Quadratic Weighted Kappa, macro-F1, and accuracy in both disease severity and astrophysics (Villar et al., 2023, Polat et al., 2022, Polat et al., 2024).

5. Implementation, Pseudocode, and Theoretical Guarantees

Implementation is functionally straightforward—introducing a per-sample or per-pixel weight vector/matrix in the CE loss computation. Representative PyTorch/numpy-style pseudocode appears in all cited works (Villar et al., 2023, Maldonado et al., 2023, Hosseini et al., 2024). For instance, in the flat case:

1
2
3
4
5
6
def weighted_cross_entropy(logits, targets, class_weights):
    pred_probs = softmax(logits, dim=1)
    # targets: one-hot or integer-encoded
    weights = class_weights[targets]  # shape: (batch_size,)
    loss = -weights * log(pred_probs[range(len(targets)), targets])
    return loss.mean()

Theoretical work demonstrates that, for any choice of positive weights, weighted CE loss is the unique continuous convex surrogate optimally minimizing the expected weighted error metric when the score is linear in entries of the confusion matrix (Marchetti et al., 2023).

6. Limitations, Best Practices, and Evolving Variants

  • Class-weight Selection: Overly large inverse frequency weights may destabilize optimization if rare classes have extremely low counts. Normalization and hyperparameter clipping are standard remedies.
  • Adaptive Weighting: Dynamic (epoch/batch-wise) weighting schemes—e.g., OWAdapt and recall-based losses—address shifting class difficulty, automatically focusing learning on underperforming classes (Maldonado et al., 2023, Tian et al., 2021).
  • Spatial/Structured Weighting: For segmentation, adding spatial structure via geometric or morphological priors substantially boosts precision, especially for small and complex objects (Hosseini et al., 2024, Guerrero-Pena et al., 2018, Shu, 9 Jul 2025).
  • Calibration Caveats: Hierarchical and per-level weighted losses may produce pseudo-probabilities requiring post-hoc calibration if probability estimates are required for downstream tasks (Villar et al., 2023).
  • Hybrid Schemes: Combining class-weighted CE with region-based or focus losses (Dice, Focal) is common in practice, particularly in medical image segmentation (Hosseini et al., 2024).
Variant Weight Source Application Example
Static WCE Inverse class frequency Generic imbalance, object detection
DBCE, DWM, SAW Morphological, pixel geometry Med. image segmentation, cell analysis
Hierarchical CE Taxonomy/level weighted Astrophysical transients
SimLoss Class similarity matrix Ordinal/semantic classification
CDW-CE Class distance in label space Disease severity, ordinal regression
OWAdapt Per-class/batch adaptive Imbalanced multiclass

7. Connections to Implicit Geometry and Modern Training Regimes

Recent advances explore the effect of class-weighted and logit-adjusted cross-entropy variants on the learned classifier and embedding geometry. Weighted CE can yield only marginal improvements in class separation in overparameterized regimes. Multiplicative logit-scaling (e.g., label-dependent temperature, LDT) can enforce symmetric geometry (Simplex Equiangular Tight Frames), overcoming the limitations of standard weighting, especially for large models and extreme imbalance (Behnia et al., 2023).

A direct implication is that simple weighting strategies offer most value in small to moderate scale and in the presence of modest imbalance, while advanced geometric or adaptive methods are called for in challenging or high-performance settings.


References

  • "Hierarchical Cross-entropy Loss for Classification of Astrophysical Transients" (Villar et al., 2023)
  • "SimLoss: Class Similarities in Cross Entropy" (Kobs et al., 2020)
  • "Class Distance Weighted Cross-Entropy Loss for Ulcerative Colitis Severity Estimation" (Polat et al., 2022)
  • "Multiclass Weighted Loss for Instance Segmentation of Cluttered Cells" (Guerrero-Pena et al., 2018)
  • "Edge-Boundary-Texture Loss: A Tri-Class Generalization of Weighted Binary Cross-Entropy for Enhanced Edge Detection" (Shu, 9 Jul 2025)
  • "OWAdapt: An adaptive loss function for deep learning using OWA operators" (Maldonado et al., 2023)
  • "Resolving Class Imbalance in Object Detection with Weighted Cross Entropy Losses" (Phan et al., 2020)
  • "A comprehensive theoretical framework for the optimization of neural networks classification performance with respect to weighted metrics" (Marchetti et al., 2023)
  • "Dilated Balanced Cross Entropy Loss for Medical Image Segmentation" (Hosseini et al., 2024)
  • "Striking the Right Balance: Recall Loss for Semantic Segmentation" (Tian et al., 2021)
  • "Class Distance Weighted Cross Entropy Loss for Classification of Disease Severity" (Polat et al., 2024)
  • "On the Implicit Geometry of Cross-Entropy Parameterizations for Label-Imbalanced Data" (Behnia et al., 2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class-Weighted Cross-Entropy Loss.