Weighted Cross-Entropy (WCE)

Updated 17 January 2026

Weighted Cross-Entropy is a loss function that integrates weighting factors to address class imbalance, cost-sensitive errors, and ordinal misclassifications.
It employs strategies like class prior inversion and pairwise cost matrices to tailor loss penalties to specific domain requirements.
It has proven effective in boosting recall in imbalanced object detection, enhancing ordinal regression accuracy, and reducing error rates in multilingual ASR.

Weighted Cross-Entropy (WCE) is a fundamental loss refinement in statistical learning and deep neural network optimization, generalizing the standard cross-entropy by incorporating explicit weighting factors. This mechanism enables practitioners to address class imbalance, incorporate domain-specific cost structures, or penalize errors asymmetrically according to task structure or real-world priorities. WCE arises in a variety of contexts, including multi-class and binary classification, ordinal regression, imbalanced object detection, edge detection, and domain-adapted tasks such as low-resource language modeling and cost-sensitive decision problems.

1. Mathematical Formulation

For a dataset of $N$ examples and $K$ classes, the unweighted multi-class cross-entropy (CE) loss is given by

$L_{\rm CE} = -\sum_{i=1}^N \sum_{k=1}^K y_{i,k} \log p_{i,k}$

where $y_{i,k} \in \{0,1\}$ are one-hot labels and $p_{i,k}$ are predicted probabilities. In Weighted Cross-Entropy (WCE), each term is multiplied by a positive weight $w_k$ (or possibly $w_{i,k}$ , for example- or label-pair-specific weighting), resulting in

$L_{\rm WCE} = -\sum_{i=1}^N \sum_{k=1}^K w_k\, y_{i,k} \log p_{i,k}$

Variants exist for the binary case, where weights can be directly tied to the positive and negative classes and for specialized formulations such as the per-label pair weights in the Real-World-Weight Cross-Entropy (RWWCE) and class-distance-dependent penalties in ordinal regression (Ho et al., 2020, Polat et al., 2024). Table 1 presents representative WCE formulations.

Application Mode	Weight Definition	Remarks
Class imbalance correction	$w_k=1/\pi_k$	$\pi_k$ : class prior
False positive/negative cost	$w_k$ or $w_{k,k'} = \mathrm{cost}$	Pairwise cost matrix for mislabels
Ordinal class distance	$w_{y_i,k}=\|y_i-k\|^\alpha$	$\alpha$ : distance penalty exponent
Edge/boundary/texture assignment	$w_i$ by pixel spatial class	Edge detection with structure-aware loss

2. Motivations for Weighting

The primary motivation for WCE is to compensate for limitations of CE in non-ideal settings:

Class Imbalance: When some classes are underrepresented, WCE amplifies the loss for minority samples, driving stronger gradients for rare categories. This is widely used in object detection, medical segmentation, and speech recognition for low-resource languages (Phan et al., 2020, Piñeiro-Martín et al., 2024).
Cost-Sensitive Learning: Incorporating real-world cost information allows the loss to prioritize domain-relevant error modes—even at the expense of aggregate accuracy—by using cost matrices or marginal weights, as formalized in the RWWCE loss (Ho et al., 2020).
Ordinal Errors: In tasks with inherent class order (e.g., disease severity), WCE can penalize distant misclassifications more heavily by encoding a distance metric within the weight definition (Polat et al., 2024, Polat et al., 2022).
Metric Alignment: In some contexts, weights are tuned dynamically to align loss minimization with a target performance metric, such as $F_\beta$ , by using distributional assumptions to modulate the penalty structure (Ramdhani, 2022).
Structural Precision: In pixel-level tasks such as edge detection, WCE is extended to incorporate spatial or context-aware weighting (e.g., edge vs. boundary vs. texture pixels) (Shu, 9 Jul 2025).

3. Weight Design Strategies

Weight selection is highly domain- and task-dependent. Common strategies include:

Class Prior Inversion: Set $w_k = 1/\pi_k$ , where $\pi_k$ is the empirical or estimated prior of class $k$ (Phan et al., 2020, Piñeiro-Martín et al., 2024).
User-Defined/Cost-Driven Weights: Domain knowledge or economic considerations set $w_k$ or $w_{k,k'}$ as direct proxies for the marginal cost of false negatives and false positives (Ho et al., 2020).
Distance-Aware Weighting for Ordinality: Use $w_{y,k}=|y-k|^\alpha$ on ordinal horizontal axes, with $\alpha$ tuned for error prioritization (Polat et al., 2024, Polat et al., 2022).
Adaptive or Dynamic Weighting: Adjust weights on-the-fly during training to match performance targets or loss differentials between groups (e.g., dynamic ratio-based scheduling in multilingual ASR, $\beta$ -optimal penalty in F-measure alignment) (Piñeiro-Martín et al., 2024, Ramdhani, 2022).
Structural and Contextual Weighting: Assign per-example or per-pixel weights to emphasize difficult contexts, ambiguous spatial regions, or rare morphological structures (Shu, 9 Jul 2025).

Empirical hyperparameter choices are typically guided by cross-validation, domain heuristics, or through explicit ablation studies.

4. Empirical Applications and Performance

4.1 Class-Imbalanced Object Detection

WCE and its variants (balanced CE, focal loss, class-balanced loss) yield substantial improvements in recall for rare object categories in datasets with severe long-tail distributions. For example, class-average recall for minority classes in BDD100K increased by 18–20 percentage points with class-balanced WCE compared to standard CE, while overall recall on dominant classes showed only minor tradeoff (Phan et al., 2020).

4.2 Ordinal Regression and Disease Severity

In disease severity grading with class-ordered categories, Class Distance Weighted CE (CDW-CE) consistently outperformed categorical CE and prominent ordinal losses. Quantitative gains included higher Quadratic Weighted Kappa, accuracy, F1, and lower mean absolute error. Additionally, latent feature quality (Silhouette Score) and clinical interpretability of model attention (CAMs) were improved with CDW-CE (Polat et al., 2024, Polat et al., 2022).

4.3 Cost-Sensitive Classification and Real-World Alignment

The RWWCE approach, directly encoding real-world costs into loss weights, halved domain-specific economic losses in binary digit detection and corrected high-marginal cost confusions in multiclass MNIST experiments, even while slightly sacrificing unweighted accuracy (Ho et al., 2020). The ability to fine-tune pairwise error impacts is unique to this cost-matrix-driven extension.

4.4 Multilingual Speech Recognition

Language-weighted cross-entropy enabled effective low-resource language integration in continual learning settings, achieving over 6% word error rate (WER) reduction for the low-resource language (Galician) without degrading performance on high-resource groups in multilingual ASR (Piñeiro-Martín et al., 2024). Both fixed schedules and ratio-based dynamic weights proved effective, especially in conjunction with targeted data augmentation.

4.5 Metric-Aligned Optimization

Incorporating target metric considerations—e.g., $F_\beta$ —into the WCE framework via distribution-derived dynamic penalties improved $F_1$ scores (e.g., +14% in noisy IMDB sentiment, +1–28% on image/text datasets) and enhanced robustness to label noise. The penalty calculation flows from a probabilistic reformulation of $F_\beta$ and knee-curve optimization for $\beta_{\rm opt}$ (Ramdhani, 2022).

5. Theoretical and Practical Limitations

Metric Mismatch: WCE is not generally a tight surrogate for overlap-based set similarity metrics such as Dice or Jaccard. Both theoretical lower/upper bounds and empirical results demonstrate that no choice of pixelwise weights in WCE guarantees bounded-error approximation to these metrics, especially for small or sparse structures. Direct surrogates like soft-Dice or Lovász-softmax are provably superior for these evaluation settings (Bertels et al., 2019).
Weight Specification: Setting weights demands reliable cost, class frequency, or distance estimates. For cost-sensitive learning, inaccuracies in cost matrices may degrade practical utility (Ho et al., 2020).
Tuning Overhead: Weighting schemes with multiple hyperparameters (e.g., $\alpha$ in CDW-CE, $\beta$ in class-balanced loss) require cross-validation and may introduce instability at high penalty values (Polat et al., 2024, Phan et al., 2020).
Task-Specificity and Transferability: WCE presumes a meaningful and stable mapping from weights to loss impact. Domain shifts, label noise, or non-stationarity in class balances can challenge effectiveness.

6. Extensions and Generalizations

Structure-Aware Weighting: Generalizations of WCE to higher granularity (e.g., multi-scale region, spatial context) provide further adaptability, exemplified by Edge-Boundary-Texture loss in edge detection—yielding stable, hyperparameter-insensitive improvements on structured prediction tasks (Shu, 9 Jul 2025).
Dynamic and Sample-Dependent Weighting: Adaptively modulating WCE penalties based on instantaneous loss ratios, predicted difficulty, or meta-metric optimization (e.g., searching for optimal $F_\beta$ weightening per batch) broadens the impact of the paradigm (Piñeiro-Martín et al., 2024, Ramdhani, 2022).
Ordinal/Continuous Targets: WCE variants incorporating non-uniform, task-specific distances (e.g., class-distance, expert-informed) open the door to genuinely ordinal or even hybrid regression-classification objectives (Polat et al., 2024, Polat et al., 2022).
Multi-label and Structured Outputs: Extension to multilabel, multioutput, or graph-structured prediction requires tensorizing the cost or weight matrices, a non-trivial but natural generalization (Ho et al., 2020).

7. Guidelines and Recommendations

WCE is recommended for tasks with significant class imbalance or when certain error modes are more consequential.
For semantic segmentation evaluated by Dice/Jaccard, direct metric surrogates are favored over WCE.
In cost- or risk-aware domains, RWWCE or pairwise-weighted multiclass WCE should be used with carefully elicited cost matrices.
For ordinal targets, CDW-CE or similar distance-weighted losses are practical and yield demonstrable improvements in both predictive and interpretive performance.
Tune weight hyperparameters via stratified cross-validation, and monitor for stability especially at high imbalance or penalty regimes.
Incorporate data augmentation, dynamic weighting, or multi-detector fusion as required by the operating context.

WCE lies at the foundation of a broad family of loss adaptations essential for scalable, fair, and application-tailored learning. Its design space—spanning simple class-reweighting to fine-grained cost and distance modulation—underlies advances in robustness, interpretability, and domain-specific metric alignment across modern machine learning research.