FusedLinearCrossEntropy Loss

Updated 21 February 2026

FusedLinearCrossEntropy is a modified cross entropy loss that adds a linear term modulated by the predicted probability to sustain useful gradients.
It leverages information-theoretic principles, specifically the Jeffreys divergence, to balance model confidence and penalize overconfidence.
Empirical studies show it yields a ~0.5% improvement in top-5 accuracy on CIFAR-100 with minimal computational overhead.

FusedLinearCrossEntropy, also known as Linearly Adaptive Cross Entropy, is a modified classification loss function developed as an enhancement to the standard cross entropy criterion for use in neural network optimization. Characterized by the addition of a linear term modulated by the model’s predicted probability for the true class, it incorporates principles from information theory—specifically, the Jeffreys divergence, a symmetrized version of the Kullback–Leibler (KL) divergence. This construction leads to improved learning dynamics, particularly in settings with one-hot encoded targets, and shows empirical benefits for image classification tasks using modern deep architectures (Shim, 10 Jul 2025).

1. Mathematical Formulation and Properties

Let $C$ denote the number of classes, $z = (z_1, ..., z_C)$ the pre-softmax logits output by a neural network, and $q_i = \exp(z_i)/\sum_{k=1}^C \exp(z_k)$ the softmax-normalized probability for class $i$ . For a sample with the true class $c$ , the conventional cross entropy loss is

$L_{CE}(z, c) = -\log q_c.$

FusedLinearCrossEntropy modifies this by introducing a linearly weighted term:

$L_{\mathrm{Fused}}(z, c) = - (1 - q_c) \log q_c.$

Equivalently, the loss decomposes as

$L_{\mathrm{Fused}} = L_{CE} + L_{\mathrm{lin}},$

where

$L_{\mathrm{lin}} = q_c \log q_c.$

Thus,

$L_{\mathrm{Fused}} = -\log q_c + q_c \log q_c = - (1 - q_c) \log q_c.$

No additional hyperparameters are introduced; the coefficient of the linear term is set to 1 in experimental evaluation (Shim, 10 Jul 2025).

2. Information-Theoretic Motivation

The design of FusedLinearCrossEntropy is grounded in information theory, specifically the Jeffreys divergence $z = (z_1, ..., z_C)$ 0 between the empirical one-hot label distribution $z = (z_1, ..., z_C)$ 1 and the model distribution $z = (z_1, ..., z_C)$ 2. Jeffreys divergence is defined as

$z = (z_1, ..., z_C)$ 3

where $z = (z_1, ..., z_C)$ 4 and $z = (z_1, ..., z_C)$ 5 are the forward and reverse KL divergences, respectively. For one-hot targets ( $z = (z_1, ..., z_C)$ 6), these reduce to $z = (z_1, ..., z_C)$ 7 and $z = (z_1, ..., z_C)$ 8, so

$z = (z_1, ..., z_C)$ 9

Minimizing the Jeffreys divergence encourages both large $q_i = \exp(z_i)/\sum_{k=1}^C \exp(z_k)$ 0 (as in standard cross entropy) and penalizes excessive model confidence through the $q_i = \exp(z_i)/\sum_{k=1}^C \exp(z_k)$ 1 term. The added linearity results in a gradient structure that prevents vanishing signals for highly confident predictions ( $q_i = \exp(z_i)/\sum_{k=1}^C \exp(z_k)$ 2), sustaining optimization pressure late in training (Shim, 10 Jul 2025).

3. Gradient Computation and Backpropagation

The partial derivative of the fused loss with respect to the true class probability $q_i = \exp(z_i)/\sum_{k=1}^C \exp(z_k)$ 3 is

$q_i = \exp(z_i)/\sum_{k=1}^C \exp(z_k)$ 4

Since the loss does not depend directly on $q_i = \exp(z_i)/\sum_{k=1}^C \exp(z_k)$ 5 for $q_i = \exp(z_i)/\sum_{k=1}^C \exp(z_k)$ 6, $q_i = \exp(z_i)/\sum_{k=1}^C \exp(z_k)$ 7 for $q_i = \exp(z_i)/\sum_{k=1}^C \exp(z_k)$ 8. Using the softmax Jacobian

$q_i = \exp(z_i)/\sum_{k=1}^C \exp(z_k)$ 9

the backpropagation gradients are:

For $i$ 0:

$i$ 1

For $i$ 2:

$i$ 3

These gradients are then propagated through the network layers via the chain rule, analogously to standard cross entropy.

4. Empirical Evaluation and Performance

Experimental studies were conducted on image classification using a pre-activation ResNet-18 architecture (He et al. 2016), adapted for the CIFAR-100 dataset (100 classes). Training involved standard data augmentation (random crop and flip), per-pixel normalization, SGD with 0.9 momentum, weight decay $i$ 4, stepwise learning rate decay, batch size 100, and 200 epochs. The loss function was swapped from standard cross entropy to FusedLinearCrossEntropy in the training loop. Computational overhead is minimal: two extra scalar operations per sample (one subtraction, one multiplication) (Shim, 10 Jul 2025).

The main quantitative findings are summarized below:

loss function	mean top-5 error ± std
Cross Entropy	6.7 % ± 0.10 %
FusedLinearCrossEntropy	6.2 % ± 0.15 %

Results averaged over epochs 190–200 and five independent trials. FusedLinearCrossEntropy achieved a consistent ∼0.5% absolute improvement in top-5 accuracy, with slightly faster convergence post-epoch 50. The fused loss never underperformed standard cross entropy in any trial (Shim, 10 Jul 2025).

5. Computational Efficiency and Implementation

Replacing standard cross entropy with FusedLinearCrossEntropy in practice involves a trivial modification: $i$ 5 is multiplied by $i$ 6. Only two extra scalar operations per sample are introduced, resulting in negligible increase in wall-clock training time. No hyperparameter tuning was required, as the coefficient of the linear term remained at its default value of 1 during evaluation (Shim, 10 Jul 2025).

6. Limitations and Directions for Further Research

The linear term $i$ 7 in FusedLinearCrossEntropy may, in some scenarios, excessively penalize very high predicted probabilities ( $i$ 8). This could potentially harm model calibration or early training convergence, especially on highly imbalanced datasets. Introducing a scaling parameter $i$ 9 to modulate the linear term

$c$ 0

is proposed as a means to alleviate this effect. Prospective research topics include:

Theoretical convergence analysis for varying $c$ 1.
Assessing loss robustness against label noise and adversarial perturbations.
Extending the criterion to multi-label, hierarchical, or cost-sensitive settings.
Developing adaptive or learned fusion coefficients.

FusedLinearCrossEntropy maintains the structural simplicity and computational cost profile of standard cross entropy while introducing an information-theoretic regularizer that sustains useful gradients throughout training and empirically improves classification performance under the experimental conditions studied (Shim, 10 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Enhancing Cross Entropy with a Linearly Adaptive Loss Function for Optimized Classification Performance (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FusedLinearCrossEntropy.

FusedLinearCrossEntropy Loss

1. Mathematical Formulation and Properties

2. Information-Theoretic Motivation

3. Gradient Computation and Backpropagation

4. Empirical Evaluation and Performance

5. Computational Efficiency and Implementation

6. Limitations and Directions for Further Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

FusedLinearCrossEntropy Loss

1. Mathematical Formulation and Properties

2. Information-Theoretic Motivation

3. Gradient Computation and Backpropagation

4. Empirical Evaluation and Performance

5. Computational Efficiency and Implementation

6. Limitations and Directions for Further Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research