Papers
Topics
Authors
Recent
Search
2000 character limit reached

FusedLinearCrossEntropy Loss

Updated 21 February 2026
  • FusedLinearCrossEntropy is a modified cross entropy loss that adds a linear term modulated by the predicted probability to sustain useful gradients.
  • It leverages information-theoretic principles, specifically the Jeffreys divergence, to balance model confidence and penalize overconfidence.
  • Empirical studies show it yields a ~0.5% improvement in top-5 accuracy on CIFAR-100 with minimal computational overhead.

FusedLinearCrossEntropy, also known as Linearly Adaptive Cross Entropy, is a modified classification loss function developed as an enhancement to the standard cross entropy criterion for use in neural network optimization. Characterized by the addition of a linear term modulated by the model’s predicted probability for the true class, it incorporates principles from information theory—specifically, the Jeffreys divergence, a symmetrized version of the Kullback–Leibler (KL) divergence. This construction leads to improved learning dynamics, particularly in settings with one-hot encoded targets, and shows empirical benefits for image classification tasks using modern deep architectures (Shim, 10 Jul 2025).

1. Mathematical Formulation and Properties

Let CC denote the number of classes, %%%%1%%%% the pre-softmax logits output by a neural network, and qi=exp(zi)/k=1Cexp(zk)q_i = \exp(z_i)/\sum_{k=1}^C \exp(z_k) the softmax-normalized probability for class ii. For a sample with the true class cc, the conventional cross entropy loss is

LCE(z,c)=logqc.L_{CE}(z, c) = -\log q_c.

FusedLinearCrossEntropy modifies this by introducing a linearly weighted term:

LFused(z,c)=(1qc)logqc.L_{\mathrm{Fused}}(z, c) = - (1 - q_c) \log q_c.

Equivalently, the loss decomposes as

LFused=LCE+Llin,L_{\mathrm{Fused}} = L_{CE} + L_{\mathrm{lin}},

where

Llin=qclogqc.L_{\mathrm{lin}} = q_c \log q_c.

Thus,

LFused=logqc+qclogqc=(1qc)logqc.L_{\mathrm{Fused}} = -\log q_c + q_c \log q_c = - (1 - q_c) \log q_c.

No additional hyperparameters are introduced; the coefficient of the linear term is set to 1 in experimental evaluation (Shim, 10 Jul 2025).

2. Information-Theoretic Motivation

The design of FusedLinearCrossEntropy is grounded in information theory, specifically the Jeffreys divergence J(P,Q)J(P, Q) between the empirical one-hot label distribution PP and the model distribution QQ. Jeffreys divergence is defined as

J(P,Q)=D(P,Q)+D(Q,P),J(P, Q) = D(P, Q) + D(Q, P),

where D(P,Q)D(P, Q) and D(Q,P)D(Q, P) are the forward and reverse KL divergences, respectively. For one-hot targets (P(xc)=1P(x_c)=1), these reduce to D(P,Q)logqcD(P, Q) \approx -\log q_c and D(Q,P)qclogqcD(Q, P) \approx q_c \log q_c, so

J(P,Q)logqc+qclogqc=(1qc)logqc.J(P, Q) \approx -\log q_c + q_c \log q_c = - (1 - q_c) \log q_c.

Minimizing the Jeffreys divergence encourages both large qcq_c (as in standard cross entropy) and penalizes excessive model confidence through the qclogqcq_c \log q_c term. The added linearity results in a gradient structure that prevents vanishing signals for highly confident predictions (qc1q_c \rightarrow 1), sustaining optimization pressure late in training (Shim, 10 Jul 2025).

3. Gradient Computation and Backpropagation

The partial derivative of the fused loss with respect to the true class probability qcq_c is

Lqc=logqc+11qc.\frac{\partial L}{\partial q_c} = \log q_c + 1 - \frac{1}{q_c}.

Since the loss does not depend directly on qiq_i for ici \neq c, Lqi=0\frac{\partial L}{\partial q_i} = 0 for ici \neq c. Using the softmax Jacobian

qjzi=qj(δijqi),\frac{\partial q_j}{\partial z_i} = q_j (\delta_{ij} - q_i),

the backpropagation gradients are:

  • For j=cj = c:

Lzc=[logqc+11/qc]qc(1qc)\frac{\partial L}{\partial z_c} = [\log q_c + 1 - 1/q_c] \cdot q_c (1 - q_c)

  • For jcj \neq c:

Lzj=[logqc+11/qc]qcqj\frac{\partial L}{\partial z_j} = -[\log q_c + 1 - 1/q_c] \cdot q_c q_j

These gradients are then propagated through the network layers via the chain rule, analogously to standard cross entropy.

4. Empirical Evaluation and Performance

Experimental studies were conducted on image classification using a pre-activation ResNet-18 architecture (He et al. 2016), adapted for the CIFAR-100 dataset (100 classes). Training involved standard data augmentation (random crop and flip), per-pixel normalization, SGD with 0.9 momentum, weight decay 5×1045 \times 10^{-4}, stepwise learning rate decay, batch size 100, and 200 epochs. The loss function was swapped from standard cross entropy to FusedLinearCrossEntropy in the training loop. Computational overhead is minimal: two extra scalar operations per sample (one subtraction, one multiplication) (Shim, 10 Jul 2025).

The main quantitative findings are summarized below:

loss function mean top-5 error ± std
Cross Entropy 6.7 % ± 0.10 %
FusedLinearCrossEntropy 6.2 % ± 0.15 %

Results averaged over epochs 190–200 and five independent trials. FusedLinearCrossEntropy achieved a consistent ∼0.5% absolute improvement in top-5 accuracy, with slightly faster convergence post-epoch 50. The fused loss never underperformed standard cross entropy in any trial (Shim, 10 Jul 2025).

5. Computational Efficiency and Implementation

Replacing standard cross entropy with FusedLinearCrossEntropy in practice involves a trivial modification: logqc\log q_c is multiplied by (1qc)(1 - q_c). Only two extra scalar operations per sample are introduced, resulting in negligible increase in wall-clock training time. No hyperparameter tuning was required, as the coefficient of the linear term remained at its default value of 1 during evaluation (Shim, 10 Jul 2025).

6. Limitations and Directions for Further Research

The linear term qclogqcq_c \log q_c in FusedLinearCrossEntropy may, in some scenarios, excessively penalize very high predicted probabilities (qc1q_c \rightarrow 1). This could potentially harm model calibration or early training convergence, especially on highly imbalanced datasets. Introducing a scaling parameter α\alpha to modulate the linear term

L=(1αqc)logqcL = - (1 - \alpha q_c) \log q_c

is proposed as a means to alleviate this effect. Prospective research topics include:

  • Theoretical convergence analysis for varying α\alpha.
  • Assessing loss robustness against label noise and adversarial perturbations.
  • Extending the criterion to multi-label, hierarchical, or cost-sensitive settings.
  • Developing adaptive or learned fusion coefficients.

FusedLinearCrossEntropy maintains the structural simplicity and computational cost profile of standard cross entropy while introducing an information-theoretic regularizer that sustains useful gradients throughout training and empirically improves classification performance under the experimental conditions studied (Shim, 10 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FusedLinearCrossEntropy.