Weighted Binary Cross-Entropy (WBCE)

Updated 15 April 2026

Weighted Binary Cross-Entropy (WBCE) is a loss function that introduces class-dependent weights to address imbalance in binary classification tasks.
It adapts penalties for positive and negative classes, with variants that incorporate adaptive, spatial, and temporal weighting for various applications.
Empirical evidence shows WBCE improves minority class recall and overall performance in domains like computer vision, medical imaging, and audio event detection.

Weighted Binary Cross-Entropy (WBCE) is a loss function widely used in binary and multilabel classification tasks, especially for addressing class imbalance, rare event prediction, and situations where certain errors are more consequential than others. WBCE generalizes the conventional binary cross-entropy by introducing class-dependent (or instance-dependent) weights, allowing flexible penalty adjustment for positive and negative labels. Extensive empirical and theoretical analysis has established both its effectiveness and its limitations, motivating a recent profusion of generalized, adaptive, and context-aware variants.

1. Mathematical Formulation and Core Properties

Consider a binary classification setting with labels $y_i \in \{0,1\}$ , logits $z_i \in \mathbb{R}$ , and model predictions $\hat y_i = \sigma(z_i) = 1/(1+e^{-z_i})$ . The canonical WBCE loss takes the form:

$L_{WBCE} = - \sum_{i=1}^m \left[ \alpha\, y_i \log \hat y_i + (1-y_i) \log(1 - \hat y_i) \right]$

where $\alpha > 0$ is a positive-class weighting parameter (sometimes denoted $\rho, w_1$ , or equivalently left as $\alpha$ ), as in Imbalance-XGBoost (Wang et al., 2019). The negative class (when $y_i=0$ ) is optionally weighted independently, yielding the general class-weighted form:

$L_{WBCE\text{ (general)}} = - \sum_{i=1}^m \left[ w_1\, y_i \log \hat y_i + w_0\, (1-y_i) \log(1 - \hat y_i) \right]$

Per-sample or per-instance weights can be readily incorporated, and the extension to multiclass or multilabel via per-class weighting is standard practice (Hosseini et al., 2024, Phan et al., 2020, Maldonado et al., 2023, Liu et al., 2017).

Gradients and Hessians with respect to the logit $z_i$ are straightforward and required for second-order optimization, particularly in frameworks such as XGBoost:

$z_i \in \mathbb{R}$ 0

(Wang et al., 2019)

2. Motivation: Addressing Class Imbalance and Rare Events

In many applied classification domains (medical imaging, object detection, edge detection, sound event detection), label distributions are highly skewed. Standard BCE is susceptible to bias toward majorities, driving sub-optimal recall or precision for minority classes or rare boundaries. WBCE introduces explicit weights to counteract this tendency. Typical weighting strategies include:

Inverse frequency weighting: $z_i \in \mathbb{R}$ 1, where $z_i \in \mathbb{R}$ 2 is the frequency of class $z_i \in \mathbb{R}$ 3 (Hosseini et al., 2024, Phan et al., 2020).
Fixed a priori weights: Hyperparameters like $z_i \in \mathbb{R}$ 4, set empirically or by cross-validation to emphasize positive or negative classes.
Task-driven or boundary-aware weights: Per-pixel or per-frame weights designed for edge proximity (Shu, 9 Jul 2025), onset/offset localization (Song, 2024), or perceptual asymmetry (Shu, 23 Jan 2025).
Adaptive/Effective number weighting: Derived to reflect effective sample sizes or penalize errors based on task-specific distributional metrics (Phan et al., 2020, Maldonado et al., 2023, Ramdhani, 2022).

WBCE has thus become a default "plug-in" for imbalanced tasks in both tabular and structured data regimes, with integration into frameworks such as XGBoost available via simple API switches (Wang et al., 2019).

3. Variants and Generalizations

Several prominent generalizations of WBCE have been developed to overcome its limitations or tailor it to new modalities:

Adaptive/Batch-dynamic weighting:
- Effective number of samples (Cui et al.): Weights are calculated as $z_i \in \mathbb{R}$ 5 where $z_i \in \mathbb{R}$ 6 is the sample count for class $z_i \in \mathbb{R}$ 7, $z_i \in \mathbb{R}$ 8 (Phan et al., 2020).
- OWAdapt (OWA operator): Applies an ordered weighting on class-level losses based on the hardest-to-fit class at each minibatch, using RIM or exponential quantifiers (Maldonado et al., 2023).
- Dynamic $z_i \in \mathbb{R}$ 9-driven weights: Batchwise estimation of the optimal $\hat y_i = \sigma(z_i) = 1/(1+e^{-z_i})$ 0 for $\hat y_i = \sigma(z_i) = 1/(1+e^{-z_i})$ 1, then using $\hat y_i = \sigma(z_i) = 1/(1+e^{-z_i})$ 2 as a penalty for the BCE negative-class term (Ramdhani, 2022).
Spatial and temporal aware reweighting:
- Dilated Balanced Cross-Entropy (DBCE): Weights in medical image segmentation are computed from morphological dilation of class masks, penalizing errors in object boundaries and ensuring weights do not explode for very small structures (Hosseini et al., 2024).
- Onset/Offset WBCE (OWBCE): Weights are assigned to frames near event boundaries in sound event detection, using a convolved sinusoidal window to emphasize critical transition points (Song, 2024).
- Edge-Boundary-Texture (EBT) loss: Generalizes WBCE to three pixel categories (edge, boundary, texture), each with distinct weights, providing sharper and more meaningful supervision for edge detection (Shu, 9 Jul 2025).
Perceptual and prediction-driven variants:
- Symmetrization WBCE (SWBCE): Adds a prediction-driven weighted BCE to the conventional label-driven WBCE term, explicitly suppressing spurious high-confidence predictions on negatives (i.e., false positives) (Shu, 23 Jan 2025).
Multilabel/joint loss schemes:
- Softmax+WBCE: For multilabel annotation, joint loss with both softmax cross-entropy for label co-occurrence and WBCE for label independence (Liu et al., 2017).

A schematic summary:

Variant	Weighting Type	Task Context
Classic WBCE	Static class weights	Tabular, vision, XGBoost
Adaptive (OWAdapt)	Batchwise, OWA	Any, especially imbalanced
Effective-number	Global, class-dependent	Object detection, vision
Dilated BCE/DBCE	Pixelwise spatial	Med. segmentation
EBT	Edge/boundary/texture	Edge detection
Onset/Offset WBCE	Framewise, temporal	Sound event detection
SWBCE	Label + prediction side	Edge, boundary-centric
Softmax+WBCE	Multilabel, per-label	Annotation, retrieval

4. Empirical Performance and Task-Specific Insights

Binary and imbalanced tabular tasks:

WBCE in tree ensembles (Imbalance-XGBoost) augments performance on minority classes (e.g., Parkinson's detection) and is easily integrated via a weighting parameter (Wang et al., 2019).
For text data (IMDB sentiment, label noise), adaptive WBCE increases $\hat y_i = \sigma(z_i) = 1/(1+e^{-z_i})$ 3 by ~14% over BCE (Ramdhani, 2022).

Object detection:

Balanced CE (WBCE) and "effective number" weighting yield substantial recall gains for rare classes: BDD100K's minority object classes see recall jump from $\hat y_i = \sigma(z_i) = 1/(1+e^{-z_i})$ 4 (Original CE) to $\hat y_i = \sigma(z_i) = 1/(1+e^{-z_i})$ 5 (WBCE) and $\hat y_i = \sigma(z_i) = 1/(1+e^{-z_i})$ 6 (effective number) (Phan et al., 2020).

Segmentation and boundary detection:

In medical segmentation, classical inverse-frequency WBCE can degrade performance via excessive false positives; DBCE, by dilating class masks, matches or exceeds Dice+CE (e.g., polyp segmentation $\hat y_i = \sigma(z_i) = 1/(1+e^{-z_i})$ 7: DBCE 87.38 vs Dice+CE 87.06) (Hosseini et al., 2024).
For edge detection, WBCE is the baseline; EBT loss, which generalizes it, improves average precision by $\hat y_i = \sigma(z_i) = 1/(1+e^{-z_i})$ 8– $\hat y_i = \sigma(z_i) = 1/(1+e^{-z_i})$ 9 across datasets, while maintaining hyperparameter robustness (Shu, 9 Jul 2025). SWBCE further sharpens precision and recall by explicitly penalizing prediction-driven false positives (e.g., BIPED2 dataset: ODS $L_{WBCE} = - \sum_{i=1}^m \left[ \alpha\, y_i \log \hat y_i + (1-y_i) \log(1 - \hat y_i) \right]$ 0, AP $L_{WBCE} = - \sum_{i=1}^m \left[ \alpha\, y_i \log \hat y_i + (1-y_i) \log(1 - \hat y_i) \right]$ 1 over WBCE) (Shu, 23 Jan 2025).

Audio and temporal event detection:

OWBCE in sound event detection yields improvements in event-F1 ( $L_{WBCE} = - \sum_{i=1}^m \left[ \alpha\, y_i \log \hat y_i + (1-y_i) \log(1 - \hat y_i) \right]$ 2 synthetic, $L_{WBCE} = - \sum_{i=1}^m \left[ \alpha\, y_i \log \hat y_i + (1-y_i) \log(1 - \hat y_i) \right]$ 3 real) and temporal localization, particularly for frames near (onset, offset) transitions (Song, 2024).

5. Implementation and Hyperparameter Selection

Parameterization: Most WBCE implementations require $L_{WBCE} = - \sum_{i=1}^m \left[ \alpha\, y_i \log \hat y_i + (1-y_i) \log(1 - \hat y_i) \right]$ 4 or an equivalent per-class/per-instance weighting scheme. Many frameworks expose this via a specific argument (e.g., imbalance_alpha in Imbalance-XGBoost) (Wang et al., 2019).
Hyperparameters:
- For standard WBCE, $L_{WBCE} = - \sum_{i=1}^m \left[ \alpha\, y_i \log \hat y_i + (1-y_i) \log(1 - \hat y_i) \right]$ 5 in $L_{WBCE} = - \sum_{i=1}^m \left[ \alpha\, y_i \log \hat y_i + (1-y_i) \log(1 - \hat y_i) \right]$ 6 often suffices; inverse frequency or "effective number" provides automatic scaling (Phan et al., 2020).
- Adaptive/batch-dynamic variants need quantifier parameters (e.g., OWA exponent $L_{WBCE} = - \sum_{i=1}^m \left[ \alpha\, y_i \log \hat y_i + (1-y_i) \log(1 - \hat y_i) \right]$ 7 in OWAdapt (Maldonado et al., 2023) or knee-point range in F $L_{WBCE} = - \sum_{i=1}^m \left[ \alpha\, y_i \log \hat y_i + (1-y_i) \log(1 - \hat y_i) \right]$ 8-based WBCE (Ramdhani, 2022)).
- Boundary/edge-aware and prediction-driven approaches require balancing coefficients, radius/window sizes, and per-pixel calculation of distance or edge proximity (Hosseini et al., 2024, Shu, 9 Jul 2025, Shu, 23 Jan 2025).
Best practices:
- Keep background/majority weights at unity, especially in detection (Phan et al., 2020).
- When using complex spatial/temporal weighting, verify stability via ablation (e.g., DBCE and EBT show minimal sensitivity to moderate parameter changes) (Hosseini et al., 2024, Shu, 9 Jul 2025).
- Monitor class-wise metrics (recall, F1, AP) to ensure rare-class improvements are not obtained at the expense of overall performance.

6. Limitations and Theoretical Considerations

Overweighting rare classes: Naive inverse-frequency weighting can cause the model to sharply overfit minority classes, often dramatically increasing false positives (Hosseini et al., 2024, Phan et al., 2020).
Uniform negative-class treatment: Classic WBCE applies equal penalty to all negatives, regardless of boundary proximity or contextual difficulty; this motivates tri-class (EBT) and boundary-focused variants (Shu, 9 Jul 2025, Song, 2024).
Lack of performance-metric integration: Traditional WBCE is not directly aligned with task-level metrics such as F $L_{WBCE} = - \sum_{i=1}^m \left[ \alpha\, y_i \log \hat y_i + (1-y_i) \log(1 - \hat y_i) \right]$ 9 or $\alpha > 0$ 0; metrics-informed adaptive weighting (as in (Ramdhani, 2022)) partially remedies this gap.
Symmetry of error penalization: Standard WBCE does not distinguish between perceptually asymmetric costs of different error types; this underlies the development of SWBCE and other prediction-driven losses (Shu, 23 Jan 2025).
Sensitivity to noisy or misaligned labels: WBCE is robust to moderate label noise (particularly in adaptive/batchwise schemes), but extreme noise or annotation ambiguity (e.g., in edge pixels) requires additional smoothing or robustification (Ramdhani, 2022, Shu, 23 Jan 2025).

7. Broader Impact and Future Directions

Weighted Binary Cross-Entropy and its variants are now foundational in binary and multilabel classification, segmentation, detection, and annotation tasks across computer vision, medical imaging, and audio processing. Future research is focused on integrating WBCE with adaptive, context-driven weights, direct metric optimization, and tighter alignment with human perceptual and operational priors.

Emerging directions include:

Joint metric-aligned and spatially-aware weighting (combining $\alpha > 0$ 1-driven and region-based loss modulation)
Task-conditional or curriculum-driven reweighting schedules
Deeper integration of WBCE in continual learning, domain adaptation, and weakly supervised settings
Automated hyperparameter tuning via meta-learned or reinforcement-learned weight policies

These developments are enabled by transparent, modular loss design, for which WBCE remains the canonical starting point (Wang et al., 2019, Ramdhani, 2022, Shu, 9 Jul 2025).