Class-Weighted Loss in Imbalanced Learning

Updated 18 November 2025

Class-weighted loss functions are methods that assign tailored penalties to each class to counteract data imbalance.
They are widely used in neural networks, segmentation models, and graph neural networks to boost minority class detection.
Key strategies include inverse-frequency weighting, focal loss, and dynamic adjustments, enabling precise risk calibration.

Class-weighted loss functions constitute a foundational suite of techniques in both the theory and practice of imbalanced classification, bringing direct control over class-conditional risk, calibration of error costs, and principled optimization of weighted metrics. They arise in settings ranging from neural network training to nonparametric classification, complement supervised and weakly supervised learning approaches, and have been generalized to convex risk formulations, metric-oriented surrogates, and robust optimization frameworks. The central idea is to modify the objective function so that each class—often weighted inversely to its training frequency, its modeled difficulty, or its cost—receives appropriate attention during parameter updates, counteracting the natural tendency of empirical risk minimization to favor the majority classes.

1. Mathematical Foundations and Canonical Forms

The archetypal class-weighted cross-entropy loss modifies the standard categorical or binary cross-entropy by introducing a vector of per-class weights $\{w_c\}_{c=1}^K$ scaling the penalty assigned to errors for each class. For multiclass problems, this yields:

$L_{\text{CWCE}} = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^K w_c\, y_{i,c} \log p_{i,c}$

where $y_{i,c}$ is a one-hot label and $p_{i,c}$ is softmax output. Class weights are typically derived as inverse-frequency ( $w_c \propto 1/N_c$ ), median-frequency, or based on other statistics to normalize training influence (Phan et al., 2017, Le et al., 2020). For multi-label problems, class-weighted binary cross-entropy is used:

$L_{\mathrm{W\text{-}BCE}} = -\frac{1}{N} \sum_{i=1}^N \sum_{j=1}^C w_j\big[y_{ij}\log\hat{y}_{ij} + (1-y_{ij})\log(1-\hat{y}_{ij})\big]$

where $w_j$ is constructed from per-class positive frequencies (Cui, 15 Jul 2025).

Advanced formulations include the focal loss, where the weight is both class-dependent ( $\alpha_c$ ) and instance-dependent via a focusing parameter $\gamma$ :

$\ell_i = -\alpha_{y_i}\,(1-p_{i,y_i})^{\gamma}\,\log p_{i,y_i}$

(Le et al., 2020, Phan et al., 2020).

2. Weight Construction Principles

Weight computation is crucial for balancing class representation. Modalities include:

Inverse-frequency weighting: $w_c \propto 1/N_c$ (Phan et al., 2017, Le et al., 2020, Phan et al., 2020, Guerrero-Pena et al., 2018, Cui, 15 Jul 2025).
Effective number of samples: $w_c = (1-\beta)/(1-\beta^{N_c})$ , allowing softer adaptation for extreme imbalance (Phan et al., 2020).
Difficulty-based weighting: CDB loss employs $w_{c,t} = d_{c,t}^{\tau}$ , where $d_{c,t} = 1 - \text{val-acc}$ per class, optionally with dynamically adjusted focusing exponent $\tau$ (Sinha et al., 2020).
Geometry- or context-aware weights: For dense segmentation or instance tasks, weights may integrate spatial or temporal structure, boosting error margin on difficult or clinically relevant regions (Guerrero-Pena et al., 2018, Marchetti et al., 2023).
Error/cost-sensitive weights: Classical cost matrices or domain-specific costs feed directly into $w_{c}$ for applications such as medical triage or resource allocation (Xu et al., 2018, Marchetti et al., 2023).

Normalization of weights (e.g., $\sum_c w_c = K$ , $\max w_c = 1$ ) is often performed to keep learning rates stable.

3. Class-Weighted Loss in Neural Architectures

Class-weighted loss functions are implemented in a wide array of architectures:

CNNs, DNNs, Transformers: Weighting terms are embedded directly in the cross-entropy or binary cross-entropy computation for classification and detection (Phan et al., 2017, Cui, 15 Jul 2025, Le et al., 2020).
U-Net and segmentation models: Spatially-varying weights address both class imbalance and geometric features at the pixel level, with losses such as weighted cross-entropy, weighted multi-class exponential (WME), and shape-aware maps (Guerrero-Pena et al., 2018, Nikkhah et al., 28 Jun 2025).
Graph Neural Networks (GNNs): Meta-GCN utilizes meta-learned example weights computed by bilevel optimization to dynamically balance class contributions using a small meta-set (Mohammadizadeh et al., 24 Jun 2024).
Ensemble methods: Weighted misclassification risk is minimized jointly over model scores and threshold, with cross-validation ensuring unbiased metric estimates (Xu et al., 2018).
Bayesian models: Weighted-likelihood (power-likelihood) approaches embed class weights by raising per-sample likelihoods to powers $w_i \propto 1/n_{y_i}$ and renormalizing to preserve sample size (Lazic, 23 Apr 2025).

4. Extensions: Ordinal, Robust, and Complementary-Label Formulations

Class-weighted losses are generalized beyond nominal classification:

Ordinal regression: The class-distance weighted cross-entropy (CDW-CE) penalizes misclassifications proportionally to absolute class distance, introducing a tunable exponent $\alpha$ (Polat et al., 2022, Polat et al., 2 Dec 2024). For binary and multiclass ordinal targets, CDW-CE achieves state-of-the-art metrics including quadratic weighted kappa and macro-F1.
Robust risk minimization: Weight uncertainty sets $Q$ are considered; LCVaR and LHCVaR formulations minimize risk over a space of plausible class weights, providing convex dual optimization schemes for the worst-case class-conditional loss (Xu et al., 2020).
Complementary-label learning: Weighted complementary-label loss (WCLL) adjusts empirical risk by nonuniform weights proportional to the inverse of the label prior, consistently improving minority-class prediction in weakly supervised multi-class settings (Wei et al., 2022).

5. Optimization, Surrogates, and Metric-Consistent Losses

A theoretical framework for constructing class-weighted losses that target arbitrary weighted metrics (e.g., weighted F1, precision, recall, value-weighted skill scores) is provided by score-oriented loss surrogates. By smoothing confusion-matrix entries with expectation over random thresholds and inserting class weights, a differentiable loss is obtained that tightens the match between training objectives and evaluation criteria (Marchetti et al., 2023). Classical cost-sensitive learning is recovered as a special case.

Meta-learning approaches train example weights to optimize a downstream validation or meta-set performance, subsuming static weighting as a degenerate scenario. Bilevel optimization is used for methods like Meta-GCN, which achieves superior macro-F1, AUC-ROC, and accuracy on severely imbalanced graph datasets (Mohammadizadeh et al., 24 Jun 2024).

6. Practical Guidelines, Trade-offs, and Empirical Results

Practical deployment of class-weighted loss requires attention to several aspects:

Tuning and selection: Start with inverse-frequency or effective-number weighting, normalize carefully, and sweep relevant hyperparameters (e.g., exponent $\alpha$ , $\beta$ , focusing parameters) (Phan et al., 2020, Polat et al., 2022, Sinha et al., 2020).
Performance trade-offs: Weighted loss improves minority-class recall and precision but can sacrifice majority-class accuracy. Dynamic weighting (difficulty-based, recall loss) can adapt focus throughout training and mitigate precision collapse (Tian et al., 2021, Sinha et al., 2020).
Hybridization: Weighted loss can be combined with focal loss, multi-task objectives (boundary regression), augmentation, and ensembling for synergistic gains (Phan et al., 2017, Le et al., 2020, Marchetti et al., 2023).
Empirical results: Across benchmarks (DCASE, BDD100K, HAM10000, LIMUC Mayo, ACDC, Cityscapes, Synthia, CIFAR, ImageNet-LT, EGTEA), weighted loss and its extensions dominate unweighted baselines for minority-class metrics, boundary accuracy, instance recall, and ordinal prediction. Tables and confusion matrix analyses consistently demonstrate multi-point lifts in minority class, macro-F1, and clinical discrimination (Phan et al., 2017, Polat et al., 2022, Guerrero-Pena et al., 2018, Nikkhah et al., 28 Jun 2025, Sinha et al., 2020, Cui, 15 Jul 2025, Tian et al., 2021).

7. Limitations, Open Questions, and Future Directions

Static class weights can over-correct, leading to overfitting or instability under extreme imbalance; dynamic strategies (difficulty, recall, meta-learning) offer more robustness but introduce additional complexity. The estimation of appropriate cost structures remains domain-dependent. Efficient grid or greedy search in high-dimensional weight spaces is open for $C \gg 10$ multiclass problems (Khim et al., 2020). Extending uniform error to non-decomposable metrics and leveraging metric-consistent frameworks for direct optimization of clinically critical scores represent promising avenues (Marchetti et al., 2023, Xu et al., 2020).

Recent research focuses on integrating adaptive weighting schemes (dynamic, meta-learned, metric-driven), extending class-weighted loss to mixed supervision, hierarchical label structures, and rich output spaces. The landscape is unified by the principle of infusing explicit class significance, either through frequencies, difficulty, or human-defined cost, directly into the objective, so as to produce robust, interpretable, and fair decision boundaries under challenging class distributions.