FusedLinearCrossEntropy Loss
- FusedLinearCrossEntropy is a modified cross entropy loss that adds a linear term modulated by the predicted probability to sustain useful gradients.
- It leverages information-theoretic principles, specifically the Jeffreys divergence, to balance model confidence and penalize overconfidence.
- Empirical studies show it yields a ~0.5% improvement in top-5 accuracy on CIFAR-100 with minimal computational overhead.
FusedLinearCrossEntropy, also known as Linearly Adaptive Cross Entropy, is a modified classification loss function developed as an enhancement to the standard cross entropy criterion for use in neural network optimization. Characterized by the addition of a linear term modulated by the model’s predicted probability for the true class, it incorporates principles from information theory—specifically, the Jeffreys divergence, a symmetrized version of the Kullback–Leibler (KL) divergence. This construction leads to improved learning dynamics, particularly in settings with one-hot encoded targets, and shows empirical benefits for image classification tasks using modern deep architectures (Shim, 10 Jul 2025).
1. Mathematical Formulation and Properties
Let denote the number of classes, %%%%1%%%% the pre-softmax logits output by a neural network, and the softmax-normalized probability for class . For a sample with the true class , the conventional cross entropy loss is
FusedLinearCrossEntropy modifies this by introducing a linearly weighted term:
Equivalently, the loss decomposes as
where
Thus,
No additional hyperparameters are introduced; the coefficient of the linear term is set to 1 in experimental evaluation (Shim, 10 Jul 2025).
2. Information-Theoretic Motivation
The design of FusedLinearCrossEntropy is grounded in information theory, specifically the Jeffreys divergence between the empirical one-hot label distribution and the model distribution . Jeffreys divergence is defined as
where and are the forward and reverse KL divergences, respectively. For one-hot targets (), these reduce to and , so
Minimizing the Jeffreys divergence encourages both large (as in standard cross entropy) and penalizes excessive model confidence through the term. The added linearity results in a gradient structure that prevents vanishing signals for highly confident predictions (), sustaining optimization pressure late in training (Shim, 10 Jul 2025).
3. Gradient Computation and Backpropagation
The partial derivative of the fused loss with respect to the true class probability is
Since the loss does not depend directly on for , for . Using the softmax Jacobian
the backpropagation gradients are:
- For :
- For :
These gradients are then propagated through the network layers via the chain rule, analogously to standard cross entropy.
4. Empirical Evaluation and Performance
Experimental studies were conducted on image classification using a pre-activation ResNet-18 architecture (He et al. 2016), adapted for the CIFAR-100 dataset (100 classes). Training involved standard data augmentation (random crop and flip), per-pixel normalization, SGD with 0.9 momentum, weight decay , stepwise learning rate decay, batch size 100, and 200 epochs. The loss function was swapped from standard cross entropy to FusedLinearCrossEntropy in the training loop. Computational overhead is minimal: two extra scalar operations per sample (one subtraction, one multiplication) (Shim, 10 Jul 2025).
The main quantitative findings are summarized below:
| loss function | mean top-5 error ± std |
|---|---|
| Cross Entropy | 6.7 % ± 0.10 % |
| FusedLinearCrossEntropy | 6.2 % ± 0.15 % |
Results averaged over epochs 190–200 and five independent trials. FusedLinearCrossEntropy achieved a consistent ∼0.5% absolute improvement in top-5 accuracy, with slightly faster convergence post-epoch 50. The fused loss never underperformed standard cross entropy in any trial (Shim, 10 Jul 2025).
5. Computational Efficiency and Implementation
Replacing standard cross entropy with FusedLinearCrossEntropy in practice involves a trivial modification: is multiplied by . Only two extra scalar operations per sample are introduced, resulting in negligible increase in wall-clock training time. No hyperparameter tuning was required, as the coefficient of the linear term remained at its default value of 1 during evaluation (Shim, 10 Jul 2025).
6. Limitations and Directions for Further Research
The linear term in FusedLinearCrossEntropy may, in some scenarios, excessively penalize very high predicted probabilities (). This could potentially harm model calibration or early training convergence, especially on highly imbalanced datasets. Introducing a scaling parameter to modulate the linear term
is proposed as a means to alleviate this effect. Prospective research topics include:
- Theoretical convergence analysis for varying .
- Assessing loss robustness against label noise and adversarial perturbations.
- Extending the criterion to multi-label, hierarchical, or cost-sensitive settings.
- Developing adaptive or learned fusion coefficients.
FusedLinearCrossEntropy maintains the structural simplicity and computational cost profile of standard cross entropy while introducing an information-theoretic regularizer that sustains useful gradients throughout training and empirically improves classification performance under the experimental conditions studied (Shim, 10 Jul 2025).