Papers
Topics
Authors
Recent
Search
2000 character limit reached

Weighted Cross-Entropy (WCE)

Updated 17 January 2026
  • Weighted Cross-Entropy is a loss function that integrates weighting factors to address class imbalance, cost-sensitive errors, and ordinal misclassifications.
  • It employs strategies like class prior inversion and pairwise cost matrices to tailor loss penalties to specific domain requirements.
  • It has proven effective in boosting recall in imbalanced object detection, enhancing ordinal regression accuracy, and reducing error rates in multilingual ASR.

Weighted Cross-Entropy (WCE) is a fundamental loss refinement in statistical learning and deep neural network optimization, generalizing the standard cross-entropy by incorporating explicit weighting factors. This mechanism enables practitioners to address class imbalance, incorporate domain-specific cost structures, or penalize errors asymmetrically according to task structure or real-world priorities. WCE arises in a variety of contexts, including multi-class and binary classification, ordinal regression, imbalanced object detection, edge detection, and domain-adapted tasks such as low-resource language modeling and cost-sensitive decision problems.

1. Mathematical Formulation

For a dataset of NN examples and KK classes, the unweighted multi-class cross-entropy (CE) loss is given by

LCE=i=1Nk=1Kyi,klogpi,kL_{\rm CE} = -\sum_{i=1}^N \sum_{k=1}^K y_{i,k} \log p_{i,k}

where yi,k{0,1}y_{i,k} \in \{0,1\} are one-hot labels and pi,kp_{i,k} are predicted probabilities. In Weighted Cross-Entropy (WCE), each term is multiplied by a positive weight wkw_k (or possibly wi,kw_{i,k}, for example- or label-pair-specific weighting), resulting in

LWCE=i=1Nk=1Kwkyi,klogpi,kL_{\rm WCE} = -\sum_{i=1}^N \sum_{k=1}^K w_k\, y_{i,k} \log p_{i,k}

Variants exist for the binary case, where weights can be directly tied to the positive and negative classes and for specialized formulations such as the per-label pair weights in the Real-World-Weight Cross-Entropy (RWWCE) and class-distance-dependent penalties in ordinal regression (Ho et al., 2020, Polat et al., 2024). Table 1 presents representative WCE formulations.

Application Mode Weight Definition Remarks
Class imbalance correction wk=1/πkw_k=1/\pi_k πk\pi_k: class prior
False positive/negative cost wkw_k or wk,k=costw_{k,k'} = \mathrm{cost} Pairwise cost matrix for mislabels
Ordinal class distance wyi,k=yikαw_{y_i,k}=|y_i-k|^\alpha α\alpha: distance penalty exponent
Edge/boundary/texture assignment wiw_i by pixel spatial class Edge detection with structure-aware loss

2. Motivations for Weighting

The primary motivation for WCE is to compensate for limitations of CE in non-ideal settings:

  • Class Imbalance: When some classes are underrepresented, WCE amplifies the loss for minority samples, driving stronger gradients for rare categories. This is widely used in object detection, medical segmentation, and speech recognition for low-resource languages (Phan et al., 2020, Piñeiro-Martín et al., 2024).
  • Cost-Sensitive Learning: Incorporating real-world cost information allows the loss to prioritize domain-relevant error modes—even at the expense of aggregate accuracy—by using cost matrices or marginal weights, as formalized in the RWWCE loss (Ho et al., 2020).
  • Ordinal Errors: In tasks with inherent class order (e.g., disease severity), WCE can penalize distant misclassifications more heavily by encoding a distance metric within the weight definition (Polat et al., 2024, Polat et al., 2022).
  • Metric Alignment: In some contexts, weights are tuned dynamically to align loss minimization with a target performance metric, such as FβF_\beta, by using distributional assumptions to modulate the penalty structure (Ramdhani, 2022).
  • Structural Precision: In pixel-level tasks such as edge detection, WCE is extended to incorporate spatial or context-aware weighting (e.g., edge vs. boundary vs. texture pixels) (Shu, 9 Jul 2025).

3. Weight Design Strategies

Weight selection is highly domain- and task-dependent. Common strategies include:

  • Class Prior Inversion: Set wk=1/πkw_k = 1/\pi_k, where πk\pi_k is the empirical or estimated prior of class kk (Phan et al., 2020, Piñeiro-Martín et al., 2024).
  • User-Defined/Cost-Driven Weights: Domain knowledge or economic considerations set wkw_k or wk,kw_{k,k'} as direct proxies for the marginal cost of false negatives and false positives (Ho et al., 2020).
  • Distance-Aware Weighting for Ordinality: Use wy,k=ykαw_{y,k}=|y-k|^\alpha on ordinal horizontal axes, with α\alpha tuned for error prioritization (Polat et al., 2024, Polat et al., 2022).
  • Adaptive or Dynamic Weighting: Adjust weights on-the-fly during training to match performance targets or loss differentials between groups (e.g., dynamic ratio-based scheduling in multilingual ASR, β\beta-optimal penalty in F-measure alignment) (Piñeiro-Martín et al., 2024, Ramdhani, 2022).
  • Structural and Contextual Weighting: Assign per-example or per-pixel weights to emphasize difficult contexts, ambiguous spatial regions, or rare morphological structures (Shu, 9 Jul 2025).

Empirical hyperparameter choices are typically guided by cross-validation, domain heuristics, or through explicit ablation studies.

4. Empirical Applications and Performance

4.1 Class-Imbalanced Object Detection

WCE and its variants (balanced CE, focal loss, class-balanced loss) yield substantial improvements in recall for rare object categories in datasets with severe long-tail distributions. For example, class-average recall for minority classes in BDD100K increased by 18–20 percentage points with class-balanced WCE compared to standard CE, while overall recall on dominant classes showed only minor tradeoff (Phan et al., 2020).

4.2 Ordinal Regression and Disease Severity

In disease severity grading with class-ordered categories, Class Distance Weighted CE (CDW-CE) consistently outperformed categorical CE and prominent ordinal losses. Quantitative gains included higher Quadratic Weighted Kappa, accuracy, F1, and lower mean absolute error. Additionally, latent feature quality (Silhouette Score) and clinical interpretability of model attention (CAMs) were improved with CDW-CE (Polat et al., 2024, Polat et al., 2022).

4.3 Cost-Sensitive Classification and Real-World Alignment

The RWWCE approach, directly encoding real-world costs into loss weights, halved domain-specific economic losses in binary digit detection and corrected high-marginal cost confusions in multiclass MNIST experiments, even while slightly sacrificing unweighted accuracy (Ho et al., 2020). The ability to fine-tune pairwise error impacts is unique to this cost-matrix-driven extension.

4.4 Multilingual Speech Recognition

Language-weighted cross-entropy enabled effective low-resource language integration in continual learning settings, achieving over 6% word error rate (WER) reduction for the low-resource language (Galician) without degrading performance on high-resource groups in multilingual ASR (Piñeiro-Martín et al., 2024). Both fixed schedules and ratio-based dynamic weights proved effective, especially in conjunction with targeted data augmentation.

4.5 Metric-Aligned Optimization

Incorporating target metric considerations—e.g., FβF_\beta—into the WCE framework via distribution-derived dynamic penalties improved F1F_1 scores (e.g., +14% in noisy IMDB sentiment, +1–28% on image/text datasets) and enhanced robustness to label noise. The penalty calculation flows from a probabilistic reformulation of FβF_\beta and knee-curve optimization for βopt\beta_{\rm opt} (Ramdhani, 2022).

5. Theoretical and Practical Limitations

  • Metric Mismatch: WCE is not generally a tight surrogate for overlap-based set similarity metrics such as Dice or Jaccard. Both theoretical lower/upper bounds and empirical results demonstrate that no choice of pixelwise weights in WCE guarantees bounded-error approximation to these metrics, especially for small or sparse structures. Direct surrogates like soft-Dice or Lovász-softmax are provably superior for these evaluation settings (Bertels et al., 2019).
  • Weight Specification: Setting weights demands reliable cost, class frequency, or distance estimates. For cost-sensitive learning, inaccuracies in cost matrices may degrade practical utility (Ho et al., 2020).
  • Tuning Overhead: Weighting schemes with multiple hyperparameters (e.g., α\alpha in CDW-CE, β\beta in class-balanced loss) require cross-validation and may introduce instability at high penalty values (Polat et al., 2024, Phan et al., 2020).
  • Task-Specificity and Transferability: WCE presumes a meaningful and stable mapping from weights to loss impact. Domain shifts, label noise, or non-stationarity in class balances can challenge effectiveness.

6. Extensions and Generalizations

  • Structure-Aware Weighting: Generalizations of WCE to higher granularity (e.g., multi-scale region, spatial context) provide further adaptability, exemplified by Edge-Boundary-Texture loss in edge detection—yielding stable, hyperparameter-insensitive improvements on structured prediction tasks (Shu, 9 Jul 2025).
  • Dynamic and Sample-Dependent Weighting: Adaptively modulating WCE penalties based on instantaneous loss ratios, predicted difficulty, or meta-metric optimization (e.g., searching for optimal FβF_\beta weightening per batch) broadens the impact of the paradigm (Piñeiro-Martín et al., 2024, Ramdhani, 2022).
  • Ordinal/Continuous Targets: WCE variants incorporating non-uniform, task-specific distances (e.g., class-distance, expert-informed) open the door to genuinely ordinal or even hybrid regression-classification objectives (Polat et al., 2024, Polat et al., 2022).
  • Multi-label and Structured Outputs: Extension to multilabel, multioutput, or graph-structured prediction requires tensorizing the cost or weight matrices, a non-trivial but natural generalization (Ho et al., 2020).

7. Guidelines and Recommendations

  • WCE is recommended for tasks with significant class imbalance or when certain error modes are more consequential.
  • For semantic segmentation evaluated by Dice/Jaccard, direct metric surrogates are favored over WCE.
  • In cost- or risk-aware domains, RWWCE or pairwise-weighted multiclass WCE should be used with carefully elicited cost matrices.
  • For ordinal targets, CDW-CE or similar distance-weighted losses are practical and yield demonstrable improvements in both predictive and interpretive performance.
  • Tune weight hyperparameters via stratified cross-validation, and monitor for stability especially at high imbalance or penalty regimes.
  • Incorporate data augmentation, dynamic weighting, or multi-detector fusion as required by the operating context.

WCE lies at the foundation of a broad family of loss adaptations essential for scalable, fair, and application-tailored learning. Its design space—spanning simple class-reweighting to fine-grained cost and distance modulation—underlies advances in robustness, interpretability, and domain-specific metric alignment across modern machine learning research.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weighted Cross-Entropy (WCE).