Papers
Topics
Authors
Recent
Search
2000 character limit reached

Forward Cross-Entropy in Machine Learning

Updated 23 June 2026
  • Forward cross-entropy is a loss function that quantifies the expected negative log-likelihood of a model relative to an empirical data distribution, serving as a foundation for MLE in deep learning.
  • It decomposes into next-token losses in autoregressive models and mathematically connects to KL divergence, ensuring that all data modes are covered.
  • Despite its tractability and efficiency, its zero-avoiding nature can lead to over-generalization in generative tasks, prompting hybrid approaches in information gathering and robust control.

The forward cross-entropy objective is a dominant paradigm in machine learning for fitting probabilistic models, particularly within deep learning, language modeling, robust control, and sequential information gathering. It quantifies the expected negative log-likelihood of a predictive model with respect to a reference, typically empirical, data distribution, and has mathematically precise links to Kullback–Leibler (KL) divergence, maximum likelihood estimation (MLE), and robust policy optimization.

1. Formal Definition and Autoregressive Model Context

For a data-generating distribution PP over sequences x=(x1,...,xT)x=(x_1,...,x_T), and an autoregressive model QθQ_\theta parameterized by θ\theta such that Qθ(x)=t=1TQθ(xtx<t)Q_\theta(x) = \prod_{t=1}^T Q_\theta(x_t | x_{<t}), the forward cross-entropy is

H(P,Qθ)=ExP[logQθ(x)]=xP(x)logQθ(x).H(P, Q_\theta) = -\mathbb{E}_{x\sim P}\left[ \log Q_\theta(x) \right] = -\sum_x P(x)\log Q_\theta(x).

This decomposes, by the chain rule, into the sum of next-token losses:

H(P,Qθ)=ExP[t=1TlogQθ(xtx<t)].H(P,Q_\theta) = - \mathbb{E}_{x\sim P} \left[\sum_{t=1}^T \log Q_\theta(x_t|x_{<t})\right].

Empirically, over NN sequences of lengths TiT_i:

LCE(θ)=1iTii=1Nt=1TilogQθ(xt(i)x<t(i)).L_{CE}(\theta) = - \frac{1}{\sum_i T_i} \sum_{i=1}^N \sum_{t=1}^{T_i} \log Q_\theta(x_t^{(i)}|x_{<t}^{(i)}).

This formulation is fundamental to modern LLMs and sequential predictors (Zhang et al., 2023).

2. Equivalence to Maximum Likelihood Estimation

Minimizing the forward cross-entropy loss with respect to the model parameters x=(x1,...,xT)x=(x_1,...,x_T)0 is mathematically equivalent to maximizing the expected log-likelihood of the observed data—i.e., MLE. Since x=(x1,...,xT)x=(x_1,...,x_T)1 is unknown, the expectation is replaced by the empirical mean over a training dataset:

  • MLE objective: maximize x=(x1,...,xT)x=(x_1,...,x_T)2
  • Forward cross-entropy: minimize x=(x1,...,xT)x=(x_1,...,x_T)3 This equivalence underpins most supervised deep learning algorithms where the forward cross-entropy is interpreted as the training criterion for probabilistic classifiers and sequence models (Zhang et al., 2023, Skarbek, 2023).

3. Probabilistic and Information-Theoretic Interpretations

Forward cross-entropy measures how well the model’s probability mass covers the data distribution. It admits a decomposition:

x=(x1,...,xT)x=(x_1,...,x_T)4

where x=(x1,...,xT)x=(x_1,...,x_T)5 is the entropy of the data distribution (constant w.r.t. x=(x1,...,xT)x=(x_1,...,x_T)6), and x=(x1,...,xT)x=(x_1,...,x_T)7 is the forward KL divergence. This structure yields a “zero-avoiding” property: modes that have nonzero x=(x1,...,xT)x=(x_1,...,x_T)8 must also be assigned nonzero x=(x1,...,xT)x=(x_1,...,x_T)9, otherwise the loss diverges. Consequently, the learned model is penalized heavily for missing modes of QθQ_\theta0, but penalized only weakly for spreading probability mass onto areas where QθQ_\theta1 (Zhang et al., 2023, Kulick et al., 2014).

In information-gathering and Bayesian active learning, maximizing expected cross-entropy (“MaxCE”) between prior and posterior beliefs acts as a one-step criterion to promote exploration, formally operationalized as maximizing QθQ_\theta2, which is the expected KL from the existing to the updated posterior and is notably asymmetric (Kulick et al., 2014).

4. Variations and Computational Implementations

Forward cross-entropy serves as a core loss function in classifiers with normalized output layers, such as softmax:

QθQ_\theta3

with the categorical cross-entropy loss:

QθQ_\theta4

where QθQ_\theta5 is typically one-hot or label-smoothed. It is empirically and formally substantiated that the gradient of this composite (softmax + cross-entropy) is QθQ_\theta6; thus, explicit cross-entropy computation is redundant for gradient-based updates, as exploited in the ISBE approach for computational efficiency without impacting learning dynamics or accuracy (Skarbek, 2023).

5. Limitations and Contrast with Reverse Cross-Entropy

In high-capacity, large-data settings, minimizing forward cross-entropy theoretically recovers the true data distribution. However, in practical regimes with finite data and model capacity, its zero-avoiding nature leads to several drawbacks:

  • Over-generalization: QθQ_\theta7 is encouraged to “spread” probability mass into regions of QθQ_\theta8 where QθQ_\theta9, resulting in non-humanlike or nonsensical outputs in generative modeling (Zhang et al., 2023).
  • Language degeneration: Unbiased decoding often yields incoherent or repetitive outputs due to the weak penalty for θ\theta0 assigning positive mass outside the support of θ\theta1.
  • Data noise sensitivity: The objective enforces coverage of even very rare or noisy empirical examples (Zhang et al., 2023).

Reverse cross-entropy θ\theta2, equivalent to reverse KL θ\theta3 up to an additive entropy, exhibits “zero-forcing” behavior and avoids assigning probability mass to off-support θ\theta4, but is intractable except in synthetic or fully-known θ\theta5 cases. Mixtures of forward and reverse cross-entropy (e.g., the MixCE objective) are designed to mitigate the respective deficiencies, promoting both mode coverage and avoidance of impossible regions (Zhang et al., 2023).

6. Generalizations in Risk-Aware Control and Information Gathering

In robust dynamic programming contexts, forward cross-entropy, combined with entropy regularization, calibrates the adversarial nature and stochasticity of policies. In minsoftmax DP, the regularization term

θ\theta6

balances likelihood-driven minimization and the injection of randomness, controlling robustness and optimism. Interpolation between pure minimax, KL-regularized, risk-sensitive, and standard stochastic DP is achieved via independent tuning of forward cross-entropy and entropy weights, exceeding what single KL-regularization admits (Zutphen et al., 16 May 2025).

In iterative information-gathering and experimental design, the expected forward cross-entropy (MaxCE) between prior and posterior beliefs promotes challenge to current hypotheses and mitigates local-optima pathologies observed with pure expected-entropy (negative entropy) objectives. Empirical results, in domains ranging from Gaussian process model selection to robotic structure discovery, demonstrate faster model identification compared to entropy-minimization approaches, with additional benefit from mixture objectives combining MaxCE and predictive-uncertainty sampling (Kulick et al., 2014).

7. Computational and Practical Considerations

Computing the forward cross-entropy objective and its gradient is efficient when combined with normalized output layers, enabling vectorized implementation as in ISBE, where the explicit loss layer is omitted in favor of direct error signal computation. In control-theoretic or active information acquisition settings, forward cross-entropy governs the exploration/exploitation trade-off and can be efficiently incorporated into dynamic programming updates via log-sum-exp operations or quadratic analytic forms in linear-Gaussian problems (Skarbek, 2023, Zutphen et al., 16 May 2025).

A summary of key roles and properties of the forward cross-entropy objective is provided in the table below:

Context Mathematical Role Noted Outcomes/Limitations
Language modeling MLE via forward CE, θ\theta7 Over-generalization, degeneration (Zhang et al., 2023)
Classifier training (softmax) Efficient gradient, identical w/ ISBE Explicit CE layer is redundant (Skarbek, 2023)
Robust control (minsoftmax) Likelihood-driven adversarial regularization Tunable robustness/optimism (Zutphen et al., 16 May 2025)
Bayesian information gathering MaxCE: expected forward CE vs. prior Avoids entropic local optima (Kulick et al., 2014)

The forward cross-entropy objective remains a central tool in statistical learning due to its analytical tractability, tight links to probabilistic inference, and versatility, but practical weaknesses in open-domain generation and active learning have driven research into hybrid or alternative objective functions.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Forward Cross-Entropy Objective.