Class-Weighted Cross-Entropy Loss

Updated 13 March 2026

Class-weighted cross-entropy loss is a cost-sensitive loss function that assigns explicit weights to handle class imbalance and cost asymmetry in training.
It incorporates both static and dynamic weighting strategies, such as inverse-frequency and effective-number scaling, to tailor the optimization process for various applications.
Applied in segmentation, object detection, and ordinal regression, it improves performance metrics like recall and stability by aligning training with specific cost-sensitive objectives.

Class-weighted cross-entropy loss is a cost-sensitive modification of the standard cross-entropy objective, introduced to address class imbalance, cost asymmetry, and structured supervision in deep learning. It assigns explicit weights to class- or pixel-level error terms during training, shaping the optimization trajectory to align with weighted log-likelihood metrics or application-specific constraints. This approach is foundational in domains such as imbalanced classification, semantic/instance segmentation, object detection, and ordinal regression, and serves as the basis for a diverse array of contemporary weighting strategies and loss function variants.

1. Formal Definition and Variants

The general form for the class-weighted cross-entropy loss in a $d$ -class classification scenario is

$L_{\text{WCE}} = -\sum_{i=1}^n \sum_{c=1}^d w_c\,y_{i,c} \log p_{i,c}$

where $y_{i,c}\in\{0,1\}$ is the one-hot label, $p_{i,c}$ is the predicted probability, and $w_c>0$ is the class weight (Marchetti et al., 2023). In the binary setting,

$L_{\text{WCE}} = -\sum_{i=1}^n [w_1\,y_i\log p_i + w_0(1-y_i)\log (1-p_i)]$

(Marchetti et al., 2023, Li et al., 2019).

Variants and extensions include:

Inverse-frequency weighting: $w_c \propto 1/N_c$ , where $N_c$ is the number of samples of class $c$ (Marchetti et al., 2023, Phan et al., 2020).
Effective-number weighting: $w_c\propto (1-\beta)/(1-\beta^{N_c}),\ \beta\in[0,1)$ , tempering over-emphasis on tiny classes (Phan et al., 2020).
Pixel/region-level weights: $L_{\text{WCE}} = -\sum_{i=1}^n \sum_{c=1}^d w_c\,y_{i,c} \log p_{i,c}$ 0 can vary per pixel to account for geometry or local complexity in segmentation (Guerrero-Pena et al., 2018).
Distance-aware weights: For ordinal or regression scenarios, $L_{\text{WCE}} = -\sum_{i=1}^n \sum_{c=1}^d w_c\,y_{i,c} \log p_{i,c}$ 1, penalizing misclassifications farther from the true class more heavily (Polat et al., 2024, Polat et al., 2022).
Dynamic weighting: Weights are updated during training based on instantaneous recall or error (Tian et al., 2021, Maldonado et al., 2023).

2. Theoretical Foundations and Optimization Objectives

The class-weighted cross-entropy directly minimizes the negative of a weighted log-likelihood metric: $L_{\text{WCE}} = -\sum_{i=1}^n \sum_{c=1}^d w_c\,y_{i,c} \log p_{i,c}$ 2 and is formally included in frameworks for score-oriented loss (SOL) optimization (Marchetti et al., 2023). This construction guarantees that minimizing $L_{\text{WCE}} = -\sum_{i=1}^n \sum_{c=1}^d w_c\,y_{i,c} \log p_{i,c}$ 3 maximizes the expected weighted metric if linear weights are used for the confusion-matrix entries, such as in cost- or imbalance-sensitive learning.

In the overparameterized regime, SGD with (possibly logit-adjusted) WCE induces metric alignment for under-represented classes, as established via the implicit-geometry analysis and cost-sensitive SVM abstraction (Behnia et al., 2023). Logit-adjusted reparameterizations (additive, scaling, label-dependent) provide alternative mechanisms to encode weights, yielding closed-form geometric solutions and symmetric classifier manifolds under appropriate tuning.

3. Weighting Strategies in Practice

Static Weight Assignment

Inverse-frequency: The most common approach, $L_{\text{WCE}} = -\sum_{i=1}^n \sum_{c=1}^d w_c\,y_{i,c} \log p_{i,c}$ 4, equalizes aggregate per-class loss contributions (Marchetti et al., 2023, Phan et al., 2020).
Effective-number scaling: $L_{\text{WCE}} = -\sum_{i=1}^n \sum_{c=1}^d w_c\,y_{i,c} \log p_{i,c}$ 5 further regularizes the impact of rarest classes (Phan et al., 2020).
Cost-based: $L_{\text{WCE}} = -\sum_{i=1}^n \sum_{c=1}^d w_c\,y_{i,c} \log p_{i,c}$ 6 directly encode application costs or utilities (Marchetti et al., 2023).

Dynamic/Adaptive Weights

Weights can be adaptively chosen based on per-batch recall, as in recall loss: $L_{\text{WCE}} = -\sum_{i=1}^n \sum_{c=1}^d w_c\,y_{i,c} \log p_{i,c}$ 7 where $L_{\text{WCE}} = -\sum_{i=1}^n \sum_{c=1}^d w_c\,y_{i,c} \log p_{i,c}$ 8 is instantaneous recall (Tian et al., 2021).
Ordered Weighted Average (OWA) aggregation generalizes by sorting per-class batch loss terms and re-applying a fixed weight vector per step, focusing gradients on hardest classes (Maldonado et al., 2023).
Geometry- or boundary-aware weighting introduces spatially varying pixel-level terms using morphology or distance transforms, as in DWM/SAW weighting for cell segmentation (Guerrero-Pena et al., 2018) and DBCE for medical segmentation (Hosseini et al., 2024).

Structured and Hierarchical Weighting

Structured tasks benefit from loss formulations respecting class hierarchies. Hierarchical WCE sums per-level, per-node loss terms, enforcing both tree-based structure and class imbalance correction, allowing flexible inference at coarse/fine label resolutions (Villar et al., 2023).
For edge/boundary detection, class-weighted cross-entropy has been generalized to three classes (edge, boundary, texture) with adaptive class-size compensation and fixed balancing parameters (Shu, 9 Jul 2025).

4. Domain-specific Applications and Modifications

Domain	Weighting Modification	Notable Outcome
Imb. Cls.	Inverse-freq, Eff. number, OWA	Improved macro-F1, recall for minority (Phan et al., 2020, Maldonado et al., 2023)
Segmentation	Pixel/geometry-aware, DWM/SAW	Sharper boundaries, higher F₁ for instances (Guerrero-Pena et al., 2018, Hosseini et al., 2024)
Ordinal	Distance-weighted (CDW-CE)	Lower MAE, higher QWK, more interpretable CAMs (Polat et al., 2024, Polat et al., 2022)
Detection	Per-class, focal, eff. number	+30–50% recall for rare classes, minor loss for major (Phan et al., 2020)
Edge Det.	Edge/Boundary/Texture tri-class	Gains in AP, sharper edges, robust weights (Shu, 9 Jul 2025)
Hierarchy	Tree-structured (WHXE)	Consistent multi-level prediction, 100% data usage (Villar et al., 2023)

For segmentation, instance segmentation frameworks benefit from combining CE with per-pixel weights derived from distance transforms or shape analysis, yielding superior performance at boundaries and on small or touching objects (Guerrero-Pena et al., 2018). DBCE, which distributes class weights across dilated regions around small objects, closes the performance gap between weighted CE and overlap-based losses such as Dice (Hosseini et al., 2024).

Ordinal and severity estimation tasks require distance-weighted variants (CDW-CE), penalizing errors in proportion to label distance and yielding improved quadratically weighted kappa and mean absolute error relative to categorical CE and standard ordinal losses (Polat et al., 2024, Polat et al., 2022).

Object detection leverages balanced CE, focal loss, and effective number weighting for substantial recall improvements on long-tail and minority classes; all variants can be deployed via a simple per-class scalar multiplication in loss calculation (Phan et al., 2020).

In high-precision edge extraction and grouping, the Edge–Boundary–Texture loss separates pixels into three semantically distinct categories, setting robust hyperparameters for each, with empirical gains in AP and edge sharpness across diverse datasets (Shu, 9 Jul 2025).

In hierarchical classification of astrophysical transients or similar taxonomies, per-node weighting and level-dependent scaling, as in weighted hierarchical cross-entropy, allow joint optimization of leaf and internal node errors, balanced for frequency and granularity (Villar et al., 2023).

5. Implementation, Gradient Properties, and Numerical Considerations

For standard WCE, the gradient with respect to the logits is

$L_{\text{WCE}} = -\sum_{i=1}^n \sum_{c=1}^d w_c\,y_{i,c} \log p_{i,c}$ 9

This structure is convex in the probability simplex and, for linear weighting, ensures proper optimization of the underlying weighted log-likelihood metric (Marchetti et al., 2023). Extreme weights can cause gradient explosion; routine remedies include weight normalization, adaptive optimizers, or damping schemes using $y_{i,c}\in\{0,1\}$ 0-roots or effective number-of-samples.

Drop-in usage is supported by all major frameworks and does not require architectural changes. In segmentation/structured prediction, pixel- or region-level weights are implemented via mask and morphological operations; dynamic approaches require only batch-level per-class statistics (Hosseini et al., 2024, Tian et al., 2021).

Weight selection is typically tuned via cross-validation on task-specific metrics (e.g., macro-F1, mean Dice, QWK, MAE). For ordinal variants, penalty exponents $y_{i,c}\in\{0,1\}$ 1 are effective; overly large exponents risk divergence. For DBCE, radius parameter $y_{i,c}\in\{0,1\}$ 2 is tuned to the object scale of the domain (Hosseini et al., 2024). Hierarchical setups require per-node frequency tallying and tree construction (Villar et al., 2023).

6. Empirical Results, Comparative Outcomes, and Limitations

Representative experimental findings from recent studies:

Medical segmentation: DBCE outperforms plain BCE by $y_{i,c}\in\{0,1\}$ 33–8 points in mean Dice and IoU, rivaling Dice+CE hybrids while maintaining stable optimization and higher region fidelity (Hosseini et al., 2024).
Instance segmentation: Three-class geometric WCE yields F₁ increases of ~0.23 absolute over binary CE baselines (Guerrero-Pena et al., 2018).
Imbalanced classification: Effective-number weighting and OWA-based adaptive aggregation achieve macro-F1 gains of 2–7% and minimum class recall gains up to 11% over CE (Phan et al., 2020, Maldonado et al., 2023).
Ordinal regression: CDW-CE consistently surpasses categorical CE in QWK by 0.02–0.03 and reduces multi-class MAE (Polat et al., 2024, Polat et al., 2022).
Object detection: Balanced CE and effective-number weighting lift rare-class recall by over 30 points, with minimal decrease for the majority class (Phan et al., 2020).
Edge detection: EBT increases AP by 8–32% over WBCE, with negligible sensitivity to moderate hyperparameter changes (Shu, 9 Jul 2025).

While static class-weighting can improve recall, it may amplify false positives or over-correct for trivial small classes (Tian et al., 2021, Hosseini et al., 2024). Dynamic and region-aware strategies such as recall weighting, DBCE, and shape/geometry-aware formulations mitigate these pathologies by focusing training on hard, ambiguous, or morphologically complex regions or classes.

Limitations include the need for weight selection (requiring per-task tuning), instability under extreme ratios, and, for some implementations (e.g., DBCE), computational overhead from large-scale morphological operations. Overweighting very small classes can destabilize training or degrade precision; thus, hybrid tuning strategies and carefully normalized weights are recommended.

7. Extensions, Generalizations, and Open Directions

Class-weighted cross-entropy loss forms the substrate for a spectrum of contemporary cost-sensitive, structural, adaptive, and ordinal loss functions, including but not limited to focal loss, effective-number loss, OWA-based adaptive aggregations, distance- and geometry-aware weighted CE, and hierarchical/structured objectives (Phan et al., 2020, Maldonado et al., 2023, Polat et al., 2024, Villar et al., 2023, Guerrero-Pena et al., 2018). Its flexibility enables principled weighting in line with the target metric, structured supervision, robustness to label or class imbalance, and specialized domain constraints.

Current research continues to expand the theoretical underpinnings of implicit bias under weighted and logit-adjusted cross-entropy, especially in the overparameterized and deep regime, with recent work furnishing closed-form global minimizer characterizations and explicit geometric alignment criteria (Behnia et al., 2023). Empirical work explores its integration with hybrid region/overlap-based losses, self-supervised tasks, and learnable or curriculum-driven weighting strategies.

Class-weighted cross-entropy thus remains a foundational, rigorous, and extensible objective in contemporary deep learning for managing imbalance, cost sensitivity, and structural complexity in classification and segmentation.