Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 52 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 216 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

Truncated Cross Entropy (TCE) in Deep Learning

Updated 13 September 2025
  • TCE is a family of loss functions that modifies standard cross-entropy by truncating or downweighting high-confidence predictions to reduce overconfidence and prevent model collapse.
  • It employs methods like confidence thresholding, gradient taming, and algebraic totalization to enhance numerical stability and improve robustness against label noise.
  • Empirical studies show TCE improves performance in recursive training and generative modeling by extending fidelity and reducing divergence compared to conventional approaches.

Truncated Cross Entropy (TCE) refers to a family of loss functions and related information-theoretic measures in which the standard cross-entropy formula is either selectively masked, dampened, or algebraically adjusted to achieve additional robustness, bias mitigation, or totality. TCE encompasses approaches designed to address overconfidence in recursive training, label noise in supervised learning, and algebraic edge cases in the computation of entropy and cross-entropy on arbitrary distributions.

1. Mathematical Formulations of Truncated Cross Entropy

The principal variants of TCE are based on modifications to the classical cross-entropy loss

CE(p,q)=ipilogqi\text{CE}(p, q) = -\sum_i p_i \log q_i

where pp denotes the target distribution (often a one-hot encoding), and qq the output probabilities of a classifier or generative model.

Truncation by Confidence Threshold (Shabgahi et al., 10 Sep 2025): TCE(pt)=χγ(pt)CE(pt)\text{TCE}(p_t) = \chi_\gamma(p_t) \cdot \text{CE}(p_t) with

χγ(pt)={1if ptγ 0if pt>γ\chi_\gamma(p_t) = \begin{cases} 1 & \text{if } p_t \leq \gamma \ 0 & \text{if } p_t > \gamma \end{cases}

where ptp_t is the predicted probability for the target class, and γ[0,1]\gamma \in [0, 1] is a hyperparameter. For pt>γp_t > \gamma, the loss term is dropped, effectively removing highly confident (and thus potentially over-represented) predictions.

Gradient Taming (Martinez et al., 2018): H^α(p,q)logqi={0if pi=0 [1logqi]αif pi=1\frac{\partial \hat{H}_\alpha(p, q)}{\partial \log q_i} = \begin{cases} 0 & \text{if } p_i = 0 \ - [1 - \log q_i]^{-\alpha} & \text{if } p_i = 1 \end{cases} where the parameter α>0\alpha > 0 controls how sharply the gradient is dampened for low-confidence predictions.

The integrated form is

H^α(p,q)=11αipi{(1logqi)1α11α}\hat{H}_\alpha(p, q) = \frac{1}{1-\alpha} \sum_i p_i \left\{ (1 - \log q_i)^{1-\alpha} - \frac{1}{1-\alpha} \right\}

For α=0\alpha = 0, the formulation reduces to standard cross-entropy.

Algebraic Truncation via Common Meadows (Bergstra et al., 11 Feb 2025): All operations, including logarithms on qi0q_i \leq 0, are totalized by assigning a default absorptive value, LL, such that undefined or numerically unstable instances do not cause exceptions but are algebraically isolated.

2. Motivation: Overconfidence and Model Collapse

Truncated Cross Entropy was formulated in direct response to failure modes in modern machine learning, especially recursive training on synthetic data and noise-ridden supervision. In generative modeling, repeated training on self-generated outputs leads to "model collapse"—a sharp contraction in output diversity and eventual degradation of model performance (Shabgahi et al., 10 Sep 2025). The root cause is the model’s increasing overconfidence in certain tokens, reflected in a distribution that neglects the tail and overfits the mode.

By truncating or downweighting highly confident predictions, TCE discourages feedback loops where the loss landscape is dominated by "easy" samples and forces the model to focus on lower-confidence regions, preserving output diversity over recursive generations.

3. Robustness to Label Noise and Generalization Properties

In supervised learning settings, TCE loss inherits the smooth convergence of standard cross-entropy in noiseless scenarios, as shown by the gradient being identical in high-confidence predictions (Martinez et al., 2018). However, in the presence of label noise—where ground-truth labels are corrupted or ambiguous—the dampened gradient (α\alpha parameter or truncation threshold γ\gamma) reduces susceptibility to mislabeled data.

For example, in experimental studies on the ResNet-20 architecture (datasets: MNIST, CIFAR10, CIFAR100, VSHN), TCE outperformed CE under heavy noise, yielding up to \sim9.8% improvements in Top-1 accuracy at 80% label randomization. Mild values of α\alpha (0.5–1.5) preserved convergence speed; with larger α\alpha, the behavior mimicked MSE, slowing convergence.

4. Algebraic Totalization and Computational Stability

Within algebras based on common meadows (Bergstra et al., 11 Feb 2025), entropy and cross-entropy are rendered total functions: for all qi0q_i \leq 0, log2qi=L\log_2 q_i = L. This allows TCE (and related measures) to be defined with no case distinctions or exceptional handling. In practical terms, this provides a mathematically rigorous foundation for routine truncation and avoidance of numerical instabilities, ensuring all computed loss values are well-defined—even for edge cases where standard cross-entropy would diverge.

This facilitates theoretical manipulation, flattening of expressions, and uninterrupted algebraic reasoning about information measures and loss dynamics.

5. Connections to Rényi Entropy, α-leakage, and Information Measures

TCE and its relatives are intimately connected to generalized entropy measures (Ding et al., 26 Jan 2024). Classical Rényi entropy is recast as a “tilde-f mean” cross-entropy, where

tilde-f(t)=exp(1ααt)\text{tilde-f}(t) = \exp\left(\frac{1-\alpha}{\alpha}t\right)

and the minimization yields scaled probabilities: Px(α)(x)=Px(x)αxPx(x)αP_{x_{(\alpha)}}(x) = \frac{P_x(x)^\alpha}{\sum_x P_x(x)^\alpha} This interpretation allows privacy and leakage measures such as α\alpha-leakage (difference in prior/posterior uncertainty, operationalized by Arimoto mutual information) to be unified across the order range α[0,)\alpha \in [0, \infty). TCE’s conceptual role is thus generalized: it is not only a machine learning loss, but an information-theoretic estimator of uncertainty reduction and risk-sensitive soft decision making.

Maximal leakage is recovered as the limiting case (α\alpha \to \infty), and all pointwise and average-case variants of leakage are generalized as tilde-f means.

6. Mechanism and Empirical Performance

In recursive synthetic data training (Shabgahi et al., 10 Sep 2025), the key empirical results are:

Method Collapse Delay (×) KL Divergence to True Data Factual Retention
Standard CE 1 High Poor
Truncated CE (TCE) >2.3 Lower Superior

TCE extended the "fidelity window" of generative language and image models, as measured by the number of generations before divergence, and maintained better distributional similarity (lower KL) compared to CE.

Variants have been shown effective across text (LLMing), vision (generative images, VAEs), and statistical models (GMMs), always by applying a masking or down-weighting to high-confidence tails.

7. Practical Implementation and Applicability

TCE loss functions can be implemented as simple wrappers around standard cross-entropy, requiring only a threshold hyperparameter (γ\gamma or α\alpha) and a masking function. This allows immediate integration into existing deep learning pipelines. In privacy applications, tilde-f mean cross-entropies provide operational formulas for quantifying leakage under Rényi measures.

Areas of practical use include:

TCE is integral to modern AI regimes where either data source trustworthiness or loss propagation must be controlled for robustness, fidelity, and computational stability.