Papers
Topics
Authors
Recent
Search
2000 character limit reached

Weighted Softmax Cross-Entropy Loss

Updated 14 March 2026
  • Weighted softmax cross-entropy loss is a loss function that assigns class-specific weights to modify prediction errors, enabling better handling of imbalances and cost-sensitive scenarios.
  • It is mathematically derived as a Fenchel–Young loss from the Kullback–Leibler divergence, providing interpretable gradients and a weighted softmax operator.
  • Variants like W-Softmax and DBCE improve performance in classification, segmentation, and federated learning by optimizing decision margins and addressing client heterogeneity.

Weighted softmax cross-entropy loss is a generalization of the standard softmax cross-entropy, where each class or sample is assigned a weighting factor that modifies the contribution of the corresponding prediction to the total loss. This mechanism allows preferential modeling of class imbalance, cost-sensitivity, heterogeneity in federated settings, or the explicit enlargement of angular decision margins. The formalization of weighted cross-entropy fits within frameworks based on ff-divergences and score-oriented learning, yielding interpretable gradients and direct optimization of targeted weighted metrics. Variants have been advanced for applications ranging from image classification and segmentation to federated learning, each with distinct implementations for the computation and use of weights.

1. Mathematical Formulation and Operator Foundation

Weighted softmax cross-entropy can be systematically derived as the Fenchel–Young loss generated by the Kullback–Leibler divergence with a non-uniform prior or reference vector wRKw \in \mathbb{R}^K. For logits zRKz \in \mathbb{R}^K, label yy (one-hot), and class weights wk>0w_k > 0, the loss reads

Lw(z,y)=i=1Kyilog(wiezij=1Kwjezj).L_w(z, y) = -\sum_{i=1}^K y_i\, \log\left(\frac{w_i\, e^{z_i}}{\sum_{j=1}^K w_j\, e^{z_j}}\right).

This is equivalent to applying a weighted softmax operator

pi=wiezijwjezjp_i = \frac{w_i e^{z_i}}{\sum_j w_j e^{z_j}}

and then computing standard cross-entropy between yy and pp (Roulet et al., 30 Jan 2025). The gradient with respect to zz is

zLw(z,y)=py,\nabla_z L_w(z,y) = p - y,

mirroring the form for unweighted cross-entropy, but with the class-weights tilting the soft prediction. The Fenchel–Young view grants closed-form expressions for the gradient and interprets the weighting as modifying the reference distribution in the ff-divergence, yielding convexity and a natural generalization to a broad family of losses (Roulet et al., 30 Jan 2025).

2. Class-Weighted Cross-Entropy and Theoretical Guarantees

In supervised learning, weighted softmax cross-entropy is formulated for a data batch {(xi,ci)}i=1N\{(x_i, c_i)\}_{i=1}^N with KK classes and preassigned class weights wk>0w_k > 0. The average loss becomes

Lw(θ)=1Ni=1Nk=1Kwkyi,klogpi,k,L_w(\theta) = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K w_k y_{i,k} \log p_{i,k},

where pi,k=softmax(zi,k)p_{i,k} = \operatorname{softmax}(z_{i,k}) and yi,ky_{i,k} is the one-hot target (Marchetti et al., 2023, Behnia et al., 2023).

This loss is precisely the differentiable surrogate induced by a linear weighted metric s(CM)=(FP+FN)s(\mathrm{CM}) = -(FP + FN) on the confusion matrix, as formalized in the score-oriented learning (wSOL) framework (Marchetti et al., 2023). If the evaluation score is linear in confusion matrix entries and weights depend only on the true class, then the class-weighted loss is consistent for maximizing the expected weighted score. If the score is analytic and non-linear (such as F1F_1), the weighted loss aligns up to higher-order terms.

Gradient computation for logits zi,kz_{i,k} is governed by

Lwzi,k=pi,kwc(i)wkyi,k,\frac{\partial L_w}{\partial z_{i,k}} = p_{i,k} w_{c(i)} - w_k y_{i,k},

where c(i)c(i) is the ground-truth class for example ii (Marchetti et al., 2023). Empirically, weighted cross-entropy accelerates threshold-calibration and improves performance in imbalanced or cost-sensitive scenarios without sacrificing convergence speed.

3. Weight Selection Strategies and Pitfalls

Selection of weights wkw_k determines the learning bias. Standard tactics include:

  • Inverse frequency weighting: wk1/nkw_k \propto 1/n_k, with nkn_k the number of samples in class kk.
  • Cost-matrix design: wkCk,+C,kw_k \propto C_{k,\cdot} + C_{\cdot,k}, with CC the cost-matrix.
  • Prevalence-adjusted balancing: wk=N/(Knk)w_k = N/(K n_k).
  • Meta-learned weights: wkw_k as trainable parameters optimized for a downstream metric.

For segmentation tasks, per-pixel weighting can be spatially varying. The Dilated Balanced Cross-Entropy (DBCE) approach computes local weights via mask dilation, so that dilated object regions and their boundaries receive upweighted loss contributions, addressing the pitfalls of extreme inverse frequency reweighting that would otherwise amplify gradient noise and bias boundary predictions (Hosseini et al., 2024).

Excessively large weights may destabilize training via exploding gradients; normalization or clipping, e.g., kwk=K\sum_k w_k = K, is standard.

4. Geometric and Optimization Implications

Weighted softmax cross-entropy induces a cost-sensitive SVM geometry in the unconstrained features model (UFM) abstraction. The global minima of such parameterizations admit closed-form expressions for classifier norms and angles as functions of the class weights and label imbalances (Behnia et al., 2023). Specifically:

  • Assigning higher wkw_k to minority classes increases their classifier vector norms and correspondingly enlarges SVM margins for those classes.
  • The explicit margin scaling is wcmarginc1w_c \cdot \mathrm{margin}_c \geq 1, so classes with higher weights carve out wider angular regions and stronger feature separation.
  • Despite these theoretical properties, empirical results show WSCE achieves only small gains in balanced accuracy compared to logit-adjusted alternatives (CDT/LDT), which provide more flexible geometric control under label imbalance.

Thus, while class weighting does tilt the optimization toward underrepresented classes, it offers coarse control and can slow convergence if weights are too imbalanced.

5. Specialized Weighted Softmax Variants

Negative-Focused Weights-biased Softmax (W-Softmax)

W-Softmax, proposed by Li and Wang, injects a class-focused bias αwc\alpha\,w_c (for hyperparameter α\alpha) into the negative class weights for each sample, geometrically enlarging the angular decision margin and promoting intra-class compactness and inter-class separation (Li et al., 2019). For input xx, true class cc:

wi=αwc+wiαwc+wi,    ic;wc=wc.w'_i = \frac{\alpha w_c + w_i}{\|\alpha w_c + w_i\|},\;\; i \ne c;\qquad w'_c = w_c.

The modified logits are zc=wcTxz_c = w_c^T x, zi=wiTxz_i = {w'_i}^T x, and the loss is the cross-entropy over these logits. The hyperparameter α\alpha controls the margin: larger α\alpha yields more discriminative features at possible expense of convergence speed. Empirical results on vision benchmarks yield absolute accuracy gains of up to $2$–4%4\% over standard softmax, particularly as the number of classes increases.

Re-Weighted Softmax for Federated Learning

In federated learning under client heterogeneity, "Re-Weighted Softmax Cross-Entropy" (WSM) modifies the denominator in the softmax to reflect local class frequencies β(k)\beta^{(k)} on client kk (Legate et al., 2023):

LWSM(Xk;w,β(k))=xXk[fw(x)y(x)logcβc(k)exp(fw(x)c)].L_{WSM}(X_k; w, \beta^{(k)}) = -\sum_{x\in X_k} \left[ f_w(x)_{y(x)} - \log \sum_c \beta^{(k)}_c \exp(f_w(x)_c) \right].

Classes not present in the local dataset (βc(k)=0\beta_c^{(k)}=0) are omitted from the softmax contrast term, thus mitigating "client forgetting" in the global model. Empirically, WSM reduces forgetting and increases final accuracy by $2$–8%8\% in highly non-i.i.d. federated setups.

6. Weighted Softmax in Segmentation: Dilated Balanced CE

Traditional balanced cross-entropy can degrade segmentation performance, especially under severe class imbalance, due to amplification of noise from rare classes and poor handling of boundaries (Hosseini et al., 2024). The Dilated Balanced Cross-Entropy (DBCE) modifies weights by dilating each class mask with a structuring element, assigning high weights to both small objects and their immediate neighborhoods. This approach achieves performance equal to or greater than Dice + CE losses across polyp, skin lesion, and multi-organ segmentation datasets while avoiding the pathological gradients of plain inverse-frequency weighting. The spatial weight map ensures stable training and improved precision, especially for rare structures.

7. Empirical Performance and Implementation

Empirical studies demonstrate that weighted softmax cross-entropy and its variants confer measurable benefits for imbalanced classification, cost-sensitive tasks, federated representation fidelity, and structured prediction:

  • In image classification, W-Softmax improves accuracy from 90.95% to 93.28% on CIFAR-10 and from 67.26% to 71.38% on CIFAR-100 (Li et al., 2019).
  • In federated benchmarks, WSM raises accuracy by 2–8% under severe heterogeneity (Legate et al., 2023).
  • For medical segmentation, DBCE matches or outperforms Dice+CE in mDice and mIoU without instability (Hosseini et al., 2024).

Canonical implementation in deep learning frameworks requires only modification of the softmax operator or per-sample loss weighting; well-structured pseudocode for standard settings is available in the literature (Marchetti et al., 2023, Hosseini et al., 2024).

The ff-divergence-based understanding enables further generalizations: e.g., using Tsallis α\alpha-divergence losses can outperform cross-entropy in language modeling with no change in test-time inference (Roulet et al., 30 Jan 2025). This flexibility affirms the foundational role of weighted softmax cross-entropy as both a practical tool and a theoretical construct in modern deep learning.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weighted Softmax Cross-Entropy Loss.