Weighted Softmax Cross-Entropy Loss

Updated 14 March 2026

Weighted softmax cross-entropy loss is a loss function that assigns class-specific weights to modify prediction errors, enabling better handling of imbalances and cost-sensitive scenarios.
It is mathematically derived as a Fenchel–Young loss from the Kullback–Leibler divergence, providing interpretable gradients and a weighted softmax operator.
Variants like W-Softmax and DBCE improve performance in classification, segmentation, and federated learning by optimizing decision margins and addressing client heterogeneity.

Weighted softmax cross-entropy loss is a generalization of the standard softmax cross-entropy, where each class or sample is assigned a weighting factor that modifies the contribution of the corresponding prediction to the total loss. This mechanism allows preferential modeling of class imbalance, cost-sensitivity, heterogeneity in federated settings, or the explicit enlargement of angular decision margins. The formalization of weighted cross-entropy fits within frameworks based on $f$ -divergences and score-oriented learning, yielding interpretable gradients and direct optimization of targeted weighted metrics. Variants have been advanced for applications ranging from image classification and segmentation to federated learning, each with distinct implementations for the computation and use of weights.

1. Mathematical Formulation and Operator Foundation

Weighted softmax cross-entropy can be systematically derived as the Fenchel–Young loss generated by the Kullback–Leibler divergence with a non-uniform prior or reference vector $w \in \mathbb{R}^K$ . For logits $z \in \mathbb{R}^K$ , label $y$ (one-hot), and class weights $w_k > 0$ , the loss reads

$L_w(z, y) = -\sum_{i=1}^K y_i\, \log\left(\frac{w_i\, e^{z_i}}{\sum_{j=1}^K w_j\, e^{z_j}}\right).$

This is equivalent to applying a weighted softmax operator

$p_i = \frac{w_i e^{z_i}}{\sum_j w_j e^{z_j}}$

and then computing standard cross-entropy between $y$ and $p$ (Roulet et al., 30 Jan 2025). The gradient with respect to $z$ is

$\nabla_z L_w(z,y) = p - y,$

mirroring the form for unweighted cross-entropy, but with the class-weights tilting the soft prediction. The Fenchel–Young view grants closed-form expressions for the gradient and interprets the weighting as modifying the reference distribution in the $f$ -divergence, yielding convexity and a natural generalization to a broad family of losses (Roulet et al., 30 Jan 2025).

2. Class-Weighted Cross-Entropy and Theoretical Guarantees

In supervised learning, weighted softmax cross-entropy is formulated for a data batch $\{(x_i, c_i)\}_{i=1}^N$ with $K$ classes and preassigned class weights $w_k > 0$ . The average loss becomes

$L_w(\theta) = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K w_k y_{i,k} \log p_{i,k},$

where $p_{i,k} = \operatorname{softmax}(z_{i,k})$ and $y_{i,k}$ is the one-hot target (Marchetti et al., 2023, Behnia et al., 2023).

This loss is precisely the differentiable surrogate induced by a linear weighted metric $s(\mathrm{CM}) = -(FP + FN)$ on the confusion matrix, as formalized in the score-oriented learning (wSOL) framework (Marchetti et al., 2023). If the evaluation score is linear in confusion matrix entries and weights depend only on the true class, then the class-weighted loss is consistent for maximizing the expected weighted score. If the score is analytic and non-linear (such as $F_1$ ), the weighted loss aligns up to higher-order terms.

Gradient computation for logits $z_{i,k}$ is governed by

$\frac{\partial L_w}{\partial z_{i,k}} = p_{i,k} w_{c(i)} - w_k y_{i,k},$

where $c(i)$ is the ground-truth class for example $i$ (Marchetti et al., 2023). Empirically, weighted cross-entropy accelerates threshold-calibration and improves performance in imbalanced or cost-sensitive scenarios without sacrificing convergence speed.

3. Weight Selection Strategies and Pitfalls

Selection of weights $w_k$ determines the learning bias. Standard tactics include:

Inverse frequency weighting: $w_k \propto 1/n_k$ , with $n_k$ the number of samples in class $k$ .
Cost-matrix design: $w_k \propto C_{k,\cdot} + C_{\cdot,k}$ , with $C$ the cost-matrix.
Prevalence-adjusted balancing: $w_k = N/(K n_k)$ .
Meta-learned weights: $w_k$ as trainable parameters optimized for a downstream metric.

For segmentation tasks, per-pixel weighting can be spatially varying. The Dilated Balanced Cross-Entropy (DBCE) approach computes local weights via mask dilation, so that dilated object regions and their boundaries receive upweighted loss contributions, addressing the pitfalls of extreme inverse frequency reweighting that would otherwise amplify gradient noise and bias boundary predictions (Hosseini et al., 2024).

Excessively large weights may destabilize training via exploding gradients; normalization or clipping, e.g., $\sum_k w_k = K$ , is standard.

4. Geometric and Optimization Implications

Weighted softmax cross-entropy induces a cost-sensitive SVM geometry in the unconstrained features model (UFM) abstraction. The global minima of such parameterizations admit closed-form expressions for classifier norms and angles as functions of the class weights and label imbalances (Behnia et al., 2023). Specifically:

Assigning higher $w_k$ to minority classes increases their classifier vector norms and correspondingly enlarges SVM margins for those classes.
The explicit margin scaling is $w_c \cdot \mathrm{margin}_c \geq 1$ , so classes with higher weights carve out wider angular regions and stronger feature separation.
Despite these theoretical properties, empirical results show WSCE achieves only small gains in balanced accuracy compared to logit-adjusted alternatives (CDT/LDT), which provide more flexible geometric control under label imbalance.

Thus, while class weighting does tilt the optimization toward underrepresented classes, it offers coarse control and can slow convergence if weights are too imbalanced.

5. Specialized Weighted Softmax Variants

Negative-Focused Weights-biased Softmax (W-Softmax)

W-Softmax, proposed by Li and Wang, injects a class-focused bias $\alpha\,w_c$ (for hyperparameter $\alpha$ ) into the negative class weights for each sample, geometrically enlarging the angular decision margin and promoting intra-class compactness and inter-class separation (Li et al., 2019). For input $x$ , true class $c$ :

$w'_i = \frac{\alpha w_c + w_i}{\|\alpha w_c + w_i\|},\;\; i \ne c;\qquad w'_c = w_c.$

The modified logits are $z_c = w_c^T x$ , $z_i = {w'_i}^T x$ , and the loss is the cross-entropy over these logits. The hyperparameter $\alpha$ controls the margin: larger $\alpha$ yields more discriminative features at possible expense of convergence speed. Empirical results on vision benchmarks yield absolute accuracy gains of up to $2$– $4\%$ over standard softmax, particularly as the number of classes increases.

Re-Weighted Softmax for Federated Learning

In federated learning under client heterogeneity, "Re-Weighted Softmax Cross-Entropy" (WSM) modifies the denominator in the softmax to reflect local class frequencies $\beta^{(k)}$ on client $k$ (Legate et al., 2023):

$L_{WSM}(X_k; w, \beta^{(k)}) = -\sum_{x\in X_k} \left[ f_w(x)_{y(x)} - \log \sum_c \beta^{(k)}_c \exp(f_w(x)_c) \right].$

Classes not present in the local dataset ( $\beta_c^{(k)}=0$ ) are omitted from the softmax contrast term, thus mitigating "client forgetting" in the global model. Empirically, WSM reduces forgetting and increases final accuracy by $2$– $8\%$ in highly non-i.i.d. federated setups.

6. Weighted Softmax in Segmentation: Dilated Balanced CE

Traditional balanced cross-entropy can degrade segmentation performance, especially under severe class imbalance, due to amplification of noise from rare classes and poor handling of boundaries (Hosseini et al., 2024). The Dilated Balanced Cross-Entropy (DBCE) modifies weights by dilating each class mask with a structuring element, assigning high weights to both small objects and their immediate neighborhoods. This approach achieves performance equal to or greater than Dice + CE losses across polyp, skin lesion, and multi-organ segmentation datasets while avoiding the pathological gradients of plain inverse-frequency weighting. The spatial weight map ensures stable training and improved precision, especially for rare structures.

7. Empirical Performance and Implementation

Empirical studies demonstrate that weighted softmax cross-entropy and its variants confer measurable benefits for imbalanced classification, cost-sensitive tasks, federated representation fidelity, and structured prediction:

In image classification, W-Softmax improves accuracy from 90.95% to 93.28% on CIFAR-10 and from 67.26% to 71.38% on CIFAR-100 (Li et al., 2019).
In federated benchmarks, WSM raises accuracy by 2–8% under severe heterogeneity (Legate et al., 2023).
For medical segmentation, DBCE matches or outperforms Dice+CE in mDice and mIoU without instability (Hosseini et al., 2024).

Canonical implementation in deep learning frameworks requires only modification of the softmax operator or per-sample loss weighting; well-structured pseudocode for standard settings is available in the literature (Marchetti et al., 2023, Hosseini et al., 2024).

The $f$ -divergence-based understanding enables further generalizations: e.g., using Tsallis $\alpha$ -divergence losses can outperform cross-entropy in language modeling with no change in test-time inference (Roulet et al., 30 Jan 2025). This flexibility affirms the foundational role of weighted softmax cross-entropy as both a practical tool and a theoretical construct in modern deep learning.

Markdown Report Issue Upgrade to Chat

References (6)

Loss Functions and Operators Generated by f-Divergences (2025)

A comprehensive theoretical framework for the optimization of neural networks classification performance with respect to weighted metrics (2023)

On the Implicit Geometry of Cross-Entropy Parameterizations for Label-Imbalanced Data (2023)

Dilated Balanced Cross Entropy Loss for Medical Image Segmentation (2024)

Learning Discriminative Features Via Weights-biased Softmax Loss (2019)

Re-Weighted Softmax Cross-Entropy to Control Forgetting in Federated Learning (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weighted Softmax Cross-Entropy Loss.

Weighted Softmax Cross-Entropy Loss

1. Mathematical Formulation and Operator Foundation

2. Class-Weighted Cross-Entropy and Theoretical Guarantees

3. Weight Selection Strategies and Pitfalls

4. Geometric and Optimization Implications

5. Specialized Weighted Softmax Variants

Negative-Focused Weights-biased Softmax (W-Softmax)

Re-Weighted Softmax for Federated Learning

6. Weighted Softmax in Segmentation: Dilated Balanced CE

7. Empirical Performance and Implementation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Weighted Softmax Cross-Entropy Loss

1. Mathematical Formulation and Operator Foundation

2. Class-Weighted Cross-Entropy and Theoretical Guarantees

3. Weight Selection Strategies and Pitfalls

4. Geometric and Optimization Implications

5. Specialized Weighted Softmax Variants

Negative-Focused Weights-biased Softmax (W-Softmax)

Re-Weighted Softmax for Federated Learning

6. Weighted Softmax in Segmentation: Dilated Balanced CE

7. Empirical Performance and Implementation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research