Truncated Cross Entropy (TCE) in Deep Learning

Updated 13 September 2025

TCE is a family of loss functions that modifies standard cross-entropy by truncating or downweighting high-confidence predictions to reduce overconfidence and prevent model collapse.
It employs methods like confidence thresholding, gradient taming, and algebraic totalization to enhance numerical stability and improve robustness against label noise.
Empirical studies show TCE improves performance in recursive training and generative modeling by extending fidelity and reducing divergence compared to conventional approaches.

Truncated Cross Entropy (TCE) refers to a family of loss functions and related information-theoretic measures in which the standard cross-entropy formula is either selectively masked, dampened, or algebraically adjusted to achieve additional robustness, bias mitigation, or totality. TCE encompasses approaches designed to address overconfidence in recursive training, label noise in supervised learning, and algebraic edge cases in the computation of entropy and cross-entropy on arbitrary distributions.

1. Mathematical Formulations of Truncated Cross Entropy

The principal variants of TCE are based on modifications to the classical cross-entropy loss

$\text{CE}(p, q) = -\sum_i p_i \log q_i$

where $p$ denotes the target distribution (often a one-hot encoding), and $q$ the output probabilities of a classifier or generative model.

Truncation by Confidence Threshold (Shabgahi et al., 10 Sep 2025): $\text{TCE}(p_t) = \chi_\gamma(p_t) \cdot \text{CE}(p_t)$ with

$\chi_\gamma(p_t) = \begin{cases} 1 & \text{if } p_t \leq \gamma \ 0 & \text{if } p_t > \gamma \end{cases}$

where $p_t$ is the predicted probability for the target class, and $\gamma \in [0, 1]$ is a hyperparameter. For $p_t > \gamma$ , the loss term is dropped, effectively removing highly confident (and thus potentially over-represented) predictions.

Gradient Taming (Martinez et al., 2018): $\frac{\partial \hat{H}_\alpha(p, q)}{\partial \log q_i} = \begin{cases} 0 & \text{if } p_i = 0 \ - [1 - \log q_i]^{-\alpha} & \text{if } p_i = 1 \end{cases}$ where the parameter $\alpha > 0$ controls how sharply the gradient is dampened for low-confidence predictions.

The integrated form is

$\hat{H}_\alpha(p, q) = \frac{1}{1-\alpha} \sum_i p_i \left\{ (1 - \log q_i)^{1-\alpha} - \frac{1}{1-\alpha} \right\}$

For $\alpha = 0$ , the formulation reduces to standard cross-entropy.

Algebraic Truncation via Common Meadows (Bergstra et al., 11 Feb 2025): All operations, including logarithms on $q_i \leq 0$ , are totalized by assigning a default absorptive value, $L$ , such that undefined or numerically unstable instances do not cause exceptions but are algebraically isolated.

2. Motivation: Overconfidence and Model Collapse

Truncated Cross Entropy was formulated in direct response to failure modes in modern machine learning, especially recursive training on synthetic data and noise-ridden supervision. In generative modeling, repeated training on self-generated outputs leads to "model collapse"—a sharp contraction in output diversity and eventual degradation of model performance (Shabgahi et al., 10 Sep 2025). The root cause is the model’s increasing overconfidence in certain tokens, reflected in a distribution that neglects the tail and overfits the mode.

By truncating or downweighting highly confident predictions, TCE discourages feedback loops where the loss landscape is dominated by "easy" samples and forces the model to focus on lower-confidence regions, preserving output diversity over recursive generations.

3. Robustness to Label Noise and Generalization Properties

In supervised learning settings, TCE loss inherits the smooth convergence of standard cross-entropy in noiseless scenarios, as shown by the gradient being identical in high-confidence predictions (Martinez et al., 2018). However, in the presence of label noise—where ground-truth labels are corrupted or ambiguous—the dampened gradient ( $\alpha$ parameter or truncation threshold $\gamma$ ) reduces susceptibility to mislabeled data.

For example, in experimental studies on the ResNet-20 architecture (datasets: MNIST, CIFAR10, CIFAR100, VSHN), TCE outperformed CE under heavy noise, yielding up to $\sim$ 9.8% improvements in Top-1 accuracy at 80% label randomization. Mild values of $\alpha$ (0.5–1.5) preserved convergence speed; with larger $\alpha$ , the behavior mimicked MSE, slowing convergence.

4. Algebraic Totalization and Computational Stability

Within algebras based on common meadows (Bergstra et al., 11 Feb 2025), entropy and cross-entropy are rendered total functions: for all $q_i \leq 0$ , $\log_2 q_i = L$ . This allows TCE (and related measures) to be defined with no case distinctions or exceptional handling. In practical terms, this provides a mathematically rigorous foundation for routine truncation and avoidance of numerical instabilities, ensuring all computed loss values are well-defined—even for edge cases where standard cross-entropy would diverge.

This facilitates theoretical manipulation, flattening of expressions, and uninterrupted algebraic reasoning about information measures and loss dynamics.

5. Connections to Rényi Entropy, α-leakage, and Information Measures

TCE and its relatives are intimately connected to generalized entropy measures (Ding et al., 26 Jan 2024). Classical Rényi entropy is recast as a “tilde-f mean” cross-entropy, where

$\text{tilde-f}(t) = \exp\left(\frac{1-\alpha}{\alpha}t\right)$

and the minimization yields scaled probabilities: $P_{x_{(\alpha)}}(x) = \frac{P_x(x)^\alpha}{\sum_x P_x(x)^\alpha}$ This interpretation allows privacy and leakage measures such as $\alpha$ -leakage (difference in prior/posterior uncertainty, operationalized by Arimoto mutual information) to be unified across the order range $\alpha \in [0, \infty)$ . TCE’s conceptual role is thus generalized: it is not only a machine learning loss, but an information-theoretic estimator of uncertainty reduction and risk-sensitive soft decision making.

Maximal leakage is recovered as the limiting case ( $\alpha \to \infty$ ), and all pointwise and average-case variants of leakage are generalized as tilde-f means.

6. Mechanism and Empirical Performance

In recursive synthetic data training (Shabgahi et al., 10 Sep 2025), the key empirical results are:

Method	Collapse Delay (×)	KL Divergence to True Data	Factual Retention
Standard CE	1	High	Poor
Truncated CE (TCE)	>2.3	Lower	Superior

TCE extended the "fidelity window" of generative language and image models, as measured by the number of generations before divergence, and maintained better distributional similarity (lower KL) compared to CE.

Variants have been shown effective across text (language modeling), vision (generative images, VAEs), and statistical models (GMMs), always by applying a masking or down-weighting to high-confidence tails.

7. Practical Implementation and Applicability

TCE loss functions can be implemented as simple wrappers around standard cross-entropy, requiring only a threshold hyperparameter ( $\gamma$ or $\alpha$ ) and a masking function. This allows immediate integration into existing deep learning pipelines. In privacy applications, tilde-f mean cross-entropies provide operational formulas for quantifying leakage under Rényi measures.

Areas of practical use include:

Preventing model collapse in recursive self-training on synthetic data (Shabgahi et al., 10 Sep 2025)
Improving robustness to label noise in supervised classification (Martinez et al., 2018)
Ensuring totality in algebraic and symbolic computations of entropy-related quantities (Bergstra et al., 11 Feb 2025)
Quantifying information leakage in privacy-sensitive settings via $\alpha$ -leakage, maximal leakage, and generalized mutual information (Ding et al., 26 Jan 2024)

TCE is integral to modern AI regimes where either data source trustworthiness or loss propagation must be controlled for robustness, fidelity, and computational stability.