Papers
Topics
Authors
Recent
Search
2000 character limit reached

Class-Weighted Cross-Entropy Loss

Updated 13 March 2026
  • Class-weighted cross-entropy loss is a cost-sensitive loss function that assigns explicit weights to handle class imbalance and cost asymmetry in training.
  • It incorporates both static and dynamic weighting strategies, such as inverse-frequency and effective-number scaling, to tailor the optimization process for various applications.
  • Applied in segmentation, object detection, and ordinal regression, it improves performance metrics like recall and stability by aligning training with specific cost-sensitive objectives.

Class-weighted cross-entropy loss is a cost-sensitive modification of the standard cross-entropy objective, introduced to address class imbalance, cost asymmetry, and structured supervision in deep learning. It assigns explicit weights to class- or pixel-level error terms during training, shaping the optimization trajectory to align with weighted log-likelihood metrics or application-specific constraints. This approach is foundational in domains such as imbalanced classification, semantic/instance segmentation, object detection, and ordinal regression, and serves as the basis for a diverse array of contemporary weighting strategies and loss function variants.

1. Formal Definition and Variants

The general form for the class-weighted cross-entropy loss in a dd-class classification scenario is

LWCE=i=1nc=1dwcyi,clogpi,cL_{\text{WCE}} = -\sum_{i=1}^n \sum_{c=1}^d w_c\,y_{i,c} \log p_{i,c}

where yi,c{0,1}y_{i,c}\in\{0,1\} is the one-hot label, pi,cp_{i,c} is the predicted probability, and wc>0w_c>0 is the class weight (Marchetti et al., 2023). In the binary setting,

LWCE=i=1n[w1yilogpi+w0(1yi)log(1pi)]L_{\text{WCE}} = -\sum_{i=1}^n [w_1\,y_i\log p_i + w_0(1-y_i)\log (1-p_i)]

(Marchetti et al., 2023, Li et al., 2019).

Variants and extensions include:

  • Inverse-frequency weighting: wc1/Ncw_c \propto 1/N_c, where NcN_c is the number of samples of class cc (Marchetti et al., 2023, Phan et al., 2020).
  • Effective-number weighting: wc(1β)/(1βNc), β[0,1)w_c\propto (1-\beta)/(1-\beta^{N_c}),\ \beta\in[0,1), tempering over-emphasis on tiny classes (Phan et al., 2020).
  • Pixel/region-level weights: w(p)w(p) can vary per pixel to account for geometry or local complexity in segmentation (Guerrero-Pena et al., 2018).
  • Distance-aware weights: For ordinal or regression scenarios, wc,i=icαw_{c,i}=|i-c|^\alpha, penalizing misclassifications farther from the true class more heavily (Polat et al., 2024, Polat et al., 2022).
  • Dynamic weighting: Weights are updated during training based on instantaneous recall or error (Tian et al., 2021, Maldonado et al., 2023).

2. Theoretical Foundations and Optimization Objectives

The class-weighted cross-entropy directly minimizes the negative of a weighted log-likelihood metric: sWLL(θ)=i[w1yilogpi+w0(1yi)log(1pi)]s_{\mathrm{WLL}}(\theta) = \sum_{i}[w_1\,y_i\log p_i + w_0(1-y_i)\log(1-p_i)] and is formally included in frameworks for score-oriented loss (SOL) optimization (Marchetti et al., 2023). This construction guarantees that minimizing LWCEL_{\mathrm{WCE}} maximizes the expected weighted metric if linear weights are used for the confusion-matrix entries, such as in cost- or imbalance-sensitive learning.

In the overparameterized regime, SGD with (possibly logit-adjusted) WCE induces metric alignment for under-represented classes, as established via the implicit-geometry analysis and cost-sensitive SVM abstraction (Behnia et al., 2023). Logit-adjusted reparameterizations (additive, scaling, label-dependent) provide alternative mechanisms to encode weights, yielding closed-form geometric solutions and symmetric classifier manifolds under appropriate tuning.

3. Weighting Strategies in Practice

Static Weight Assignment

  • Inverse-frequency: The most common approach, wc1/Ncw_c \propto 1/N_c, equalizes aggregate per-class loss contributions (Marchetti et al., 2023, Phan et al., 2020).
  • Effective-number scaling: wc(1β)/(1βNc)w_c \propto (1-\beta)/(1-\beta^{N_c}) further regularizes the impact of rarest classes (Phan et al., 2020).
  • Cost-based: wcw_c directly encode application costs or utilities (Marchetti et al., 2023).

Dynamic/Adaptive Weights

  • Weights can be adaptively chosen based on per-batch recall, as in recall loss: wc=1Rcw_c=1-R_c where RcR_c is instantaneous recall (Tian et al., 2021).
  • Ordered Weighted Average (OWA) aggregation generalizes by sorting per-class batch loss terms and re-applying a fixed weight vector per step, focusing gradients on hardest classes (Maldonado et al., 2023).
  • Geometry- or boundary-aware weighting introduces spatially varying pixel-level terms using morphology or distance transforms, as in DWM/SAW weighting for cell segmentation (Guerrero-Pena et al., 2018) and DBCE for medical segmentation (Hosseini et al., 2024).

Structured and Hierarchical Weighting

  • Structured tasks benefit from loss formulations respecting class hierarchies. Hierarchical WCE sums per-level, per-node loss terms, enforcing both tree-based structure and class imbalance correction, allowing flexible inference at coarse/fine label resolutions (Villar et al., 2023).
  • For edge/boundary detection, class-weighted cross-entropy has been generalized to three classes (edge, boundary, texture) with adaptive class-size compensation and fixed balancing parameters (Shu, 9 Jul 2025).

4. Domain-specific Applications and Modifications

Domain Weighting Modification Notable Outcome
Imb. Cls. Inverse-freq, Eff. number, OWA Improved macro-F1, recall for minority (Phan et al., 2020, Maldonado et al., 2023)
Segmentation Pixel/geometry-aware, DWM/SAW Sharper boundaries, higher F₁ for instances (Guerrero-Pena et al., 2018, Hosseini et al., 2024)
Ordinal Distance-weighted (CDW-CE) Lower MAE, higher QWK, more interpretable CAMs (Polat et al., 2024, Polat et al., 2022)
Detection Per-class, focal, eff. number +30–50% recall for rare classes, minor loss for major (Phan et al., 2020)
Edge Det. Edge/Boundary/Texture tri-class Gains in AP, sharper edges, robust weights (Shu, 9 Jul 2025)
Hierarchy Tree-structured (WHXE) Consistent multi-level prediction, 100% data usage (Villar et al., 2023)

For segmentation, instance segmentation frameworks benefit from combining CE with per-pixel weights derived from distance transforms or shape analysis, yielding superior performance at boundaries and on small or touching objects (Guerrero-Pena et al., 2018). DBCE, which distributes class weights across dilated regions around small objects, closes the performance gap between weighted CE and overlap-based losses such as Dice (Hosseini et al., 2024).

Ordinal and severity estimation tasks require distance-weighted variants (CDW-CE), penalizing errors in proportion to label distance and yielding improved quadratically weighted kappa and mean absolute error relative to categorical CE and standard ordinal losses (Polat et al., 2024, Polat et al., 2022).

Object detection leverages balanced CE, focal loss, and effective number weighting for substantial recall improvements on long-tail and minority classes; all variants can be deployed via a simple per-class scalar multiplication in loss calculation (Phan et al., 2020).

In high-precision edge extraction and grouping, the Edge–Boundary–Texture loss separates pixels into three semantically distinct categories, setting robust hyperparameters for each, with empirical gains in AP and edge sharpness across diverse datasets (Shu, 9 Jul 2025).

In hierarchical classification of astrophysical transients or similar taxonomies, per-node weighting and level-dependent scaling, as in weighted hierarchical cross-entropy, allow joint optimization of leaf and internal node errors, balanced for frequency and granularity (Villar et al., 2023).

5. Implementation, Gradient Properties, and Numerical Considerations

For standard WCE, the gradient with respect to the logits is

LWCEzi,c=wc(pi,cyi,c)\frac{\partial L_{\text{WCE}}}{\partial z_{i,c}} = w_c (p_{i,c} - y_{i,c})

This structure is convex in the probability simplex and, for linear weighting, ensures proper optimization of the underlying weighted log-likelihood metric (Marchetti et al., 2023). Extreme weights can cause gradient explosion; routine remedies include weight normalization, adaptive optimizers, or damping schemes using α\alpha-roots or effective number-of-samples.

Drop-in usage is supported by all major frameworks and does not require architectural changes. In segmentation/structured prediction, pixel- or region-level weights are implemented via mask and morphological operations; dynamic approaches require only batch-level per-class statistics (Hosseini et al., 2024, Tian et al., 2021).

Weight selection is typically tuned via cross-validation on task-specific metrics (e.g., macro-F1, mean Dice, QWK, MAE). For ordinal variants, penalty exponents α=57\alpha=5\text{–}7 are effective; overly large exponents risk divergence. For DBCE, radius parameter rr is tuned to the object scale of the domain (Hosseini et al., 2024). Hierarchical setups require per-node frequency tallying and tree construction (Villar et al., 2023).

6. Empirical Results, Comparative Outcomes, and Limitations

Representative experimental findings from recent studies:

  • Medical segmentation: DBCE outperforms plain BCE by \sim3–8 points in mean Dice and IoU, rivaling Dice+CE hybrids while maintaining stable optimization and higher region fidelity (Hosseini et al., 2024).
  • Instance segmentation: Three-class geometric WCE yields F₁ increases of ~0.23 absolute over binary CE baselines (Guerrero-Pena et al., 2018).
  • Imbalanced classification: Effective-number weighting and OWA-based adaptive aggregation achieve macro-F1 gains of 2–7% and minimum class recall gains up to 11% over CE (Phan et al., 2020, Maldonado et al., 2023).
  • Ordinal regression: CDW-CE consistently surpasses categorical CE in QWK by 0.02–0.03 and reduces multi-class MAE (Polat et al., 2024, Polat et al., 2022).
  • Object detection: Balanced CE and effective-number weighting lift rare-class recall by over 30 points, with minimal decrease for the majority class (Phan et al., 2020).
  • Edge detection: EBT increases AP by 8–32% over WBCE, with negligible sensitivity to moderate hyperparameter changes (Shu, 9 Jul 2025).

While static class-weighting can improve recall, it may amplify false positives or over-correct for trivial small classes (Tian et al., 2021, Hosseini et al., 2024). Dynamic and region-aware strategies such as recall weighting, DBCE, and shape/geometry-aware formulations mitigate these pathologies by focusing training on hard, ambiguous, or morphologically complex regions or classes.

Limitations include the need for weight selection (requiring per-task tuning), instability under extreme ratios, and, for some implementations (e.g., DBCE), computational overhead from large-scale morphological operations. Overweighting very small classes can destabilize training or degrade precision; thus, hybrid tuning strategies and carefully normalized weights are recommended.

7. Extensions, Generalizations, and Open Directions

Class-weighted cross-entropy loss forms the substrate for a spectrum of contemporary cost-sensitive, structural, adaptive, and ordinal loss functions, including but not limited to focal loss, effective-number loss, OWA-based adaptive aggregations, distance- and geometry-aware weighted CE, and hierarchical/structured objectives (Phan et al., 2020, Maldonado et al., 2023, Polat et al., 2024, Villar et al., 2023, Guerrero-Pena et al., 2018). Its flexibility enables principled weighting in line with the target metric, structured supervision, robustness to label or class imbalance, and specialized domain constraints.

Current research continues to expand the theoretical underpinnings of implicit bias under weighted and logit-adjusted cross-entropy, especially in the overparameterized and deep regime, with recent work furnishing closed-form global minimizer characterizations and explicit geometric alignment criteria (Behnia et al., 2023). Empirical work explores its integration with hybrid region/overlap-based losses, self-supervised tasks, and learnable or curriculum-driven weighting strategies.

Class-weighted cross-entropy thus remains a foundational, rigorous, and extensible objective in contemporary deep learning for managing imbalance, cost sensitivity, and structural complexity in classification and segmentation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class-weighted Cross-Entropy Loss.