Forward Cross-Entropy in Machine Learning

Updated 23 June 2026

Forward cross-entropy is a loss function that quantifies the expected negative log-likelihood of a model relative to an empirical data distribution, serving as a foundation for MLE in deep learning.
It decomposes into next-token losses in autoregressive models and mathematically connects to KL divergence, ensuring that all data modes are covered.
Despite its tractability and efficiency, its zero-avoiding nature can lead to over-generalization in generative tasks, prompting hybrid approaches in information gathering and robust control.

The forward cross-entropy objective is a dominant paradigm in machine learning for fitting probabilistic models, particularly within deep learning, language modeling, robust control, and sequential information gathering. It quantifies the expected negative log-likelihood of a predictive model with respect to a reference, typically empirical, data distribution, and has mathematically precise links to Kullback–Leibler (KL) divergence, maximum likelihood estimation (MLE), and robust policy optimization.

1. Formal Definition and Autoregressive Model Context

For a data-generating distribution $P$ over sequences $x=(x_1,...,x_T)$ , and an autoregressive model $Q_\theta$ parameterized by $\theta$ such that $Q_\theta(x) = \prod_{t=1}^T Q_\theta(x_t | x_{<t})$ , the forward cross-entropy is

$H(P, Q_\theta) = -\mathbb{E}_{x\sim P}\left[ \log Q_\theta(x) \right] = -\sum_x P(x)\log Q_\theta(x).$

This decomposes, by the chain rule, into the sum of next-token losses:

$H(P,Q_\theta) = - \mathbb{E}_{x\sim P} \left[\sum_{t=1}^T \log Q_\theta(x_t|x_{<t})\right].$

Empirically, over $N$ sequences of lengths $T_i$ :

$L_{CE}(\theta) = - \frac{1}{\sum_i T_i} \sum_{i=1}^N \sum_{t=1}^{T_i} \log Q_\theta(x_t^{(i)}|x_{<t}^{(i)}).$

This formulation is fundamental to modern LLMs and sequential predictors (Zhang et al., 2023).

2. Equivalence to Maximum Likelihood Estimation

Minimizing the forward cross-entropy loss with respect to the model parameters $x=(x_1,...,x_T)$ 0 is mathematically equivalent to maximizing the expected log-likelihood of the observed data—i.e., MLE. Since $x=(x_1,...,x_T)$ 1 is unknown, the expectation is replaced by the empirical mean over a training dataset:

MLE objective: maximize $x=(x_1,...,x_T)$ 2
Forward cross-entropy: minimize $x=(x_1,...,x_T)$ 3 This equivalence underpins most supervised deep learning algorithms where the forward cross-entropy is interpreted as the training criterion for probabilistic classifiers and sequence models (Zhang et al., 2023, Skarbek, 2023).

3. Probabilistic and Information-Theoretic Interpretations

Forward cross-entropy measures how well the model’s probability mass covers the data distribution. It admits a decomposition:

$x=(x_1,...,x_T)$ 4

where $x=(x_1,...,x_T)$ 5 is the entropy of the data distribution (constant w.r.t. $x=(x_1,...,x_T)$ 6), and $x=(x_1,...,x_T)$ 7 is the forward KL divergence. This structure yields a “zero-avoiding” property: modes that have nonzero $x=(x_1,...,x_T)$ 8 must also be assigned nonzero $x=(x_1,...,x_T)$ 9, otherwise the loss diverges. Consequently, the learned model is penalized heavily for missing modes of $Q_\theta$ 0, but penalized only weakly for spreading probability mass onto areas where $Q_\theta$ 1 (Zhang et al., 2023, Kulick et al., 2014).

In information-gathering and Bayesian active learning, maximizing expected cross-entropy (“MaxCE”) between prior and posterior beliefs acts as a one-step criterion to promote exploration, formally operationalized as maximizing $Q_\theta$ 2, which is the expected KL from the existing to the updated posterior and is notably asymmetric (Kulick et al., 2014).

4. Variations and Computational Implementations

Forward cross-entropy serves as a core loss function in classifiers with normalized output layers, such as softmax:

$Q_\theta$ 3

with the categorical cross-entropy loss:

$Q_\theta$ 4

where $Q_\theta$ 5 is typically one-hot or label-smoothed. It is empirically and formally substantiated that the gradient of this composite (softmax + cross-entropy) is $Q_\theta$ 6; thus, explicit cross-entropy computation is redundant for gradient-based updates, as exploited in the ISBE approach for computational efficiency without impacting learning dynamics or accuracy (Skarbek, 2023).

5. Limitations and Contrast with Reverse Cross-Entropy

In high-capacity, large-data settings, minimizing forward cross-entropy theoretically recovers the true data distribution. However, in practical regimes with finite data and model capacity, its zero-avoiding nature leads to several drawbacks:

Over-generalization: $Q_\theta$ 7 is encouraged to “spread” probability mass into regions of $Q_\theta$ 8 where $Q_\theta$ 9, resulting in non-humanlike or nonsensical outputs in generative modeling (Zhang et al., 2023).
Language degeneration: Unbiased decoding often yields incoherent or repetitive outputs due to the weak penalty for $\theta$ 0 assigning positive mass outside the support of $\theta$ 1.
Data noise sensitivity: The objective enforces coverage of even very rare or noisy empirical examples (Zhang et al., 2023).

Reverse cross-entropy $\theta$ 2, equivalent to reverse KL $\theta$ 3 up to an additive entropy, exhibits “zero-forcing” behavior and avoids assigning probability mass to off-support $\theta$ 4, but is intractable except in synthetic or fully-known $\theta$ 5 cases. Mixtures of forward and reverse cross-entropy (e.g., the MixCE objective) are designed to mitigate the respective deficiencies, promoting both mode coverage and avoidance of impossible regions (Zhang et al., 2023).

6. Generalizations in Risk-Aware Control and Information Gathering

In robust dynamic programming contexts, forward cross-entropy, combined with entropy regularization, calibrates the adversarial nature and stochasticity of policies. In minsoftmax DP, the regularization term

$\theta$ 6

balances likelihood-driven minimization and the injection of randomness, controlling robustness and optimism. Interpolation between pure minimax, KL-regularized, risk-sensitive, and standard stochastic DP is achieved via independent tuning of forward cross-entropy and entropy weights, exceeding what single KL-regularization admits (Zutphen et al., 16 May 2025).

In iterative information-gathering and experimental design, the expected forward cross-entropy (MaxCE) between prior and posterior beliefs promotes challenge to current hypotheses and mitigates local-optima pathologies observed with pure expected-entropy (negative entropy) objectives. Empirical results, in domains ranging from Gaussian process model selection to robotic structure discovery, demonstrate faster model identification compared to entropy-minimization approaches, with additional benefit from mixture objectives combining MaxCE and predictive-uncertainty sampling (Kulick et al., 2014).

7. Computational and Practical Considerations

Computing the forward cross-entropy objective and its gradient is efficient when combined with normalized output layers, enabling vectorized implementation as in ISBE, where the explicit loss layer is omitted in favor of direct error signal computation. In control-theoretic or active information acquisition settings, forward cross-entropy governs the exploration/exploitation trade-off and can be efficiently incorporated into dynamic programming updates via log-sum-exp operations or quadratic analytic forms in linear-Gaussian problems (Skarbek, 2023, Zutphen et al., 16 May 2025).

A summary of key roles and properties of the forward cross-entropy objective is provided in the table below:

Context	Mathematical Role	Noted Outcomes/Limitations
Language modeling	MLE via forward CE, $\theta$ 7	Over-generalization, degeneration (Zhang et al., 2023)
Classifier training (softmax)	Efficient gradient, identical w/ ISBE	Explicit CE layer is redundant (Skarbek, 2023)
Robust control (minsoftmax)	Likelihood-driven adversarial regularization	Tunable robustness/optimism (Zutphen et al., 16 May 2025)
Bayesian information gathering	MaxCE: expected forward CE vs. prior	Avoids entropic local optima (Kulick et al., 2014)

The forward cross-entropy objective remains a central tool in statistical learning due to its analytical tractability, tight links to probabilistic inference, and versatility, but practical weaknesses in open-domain generation and active learning have driven research into hybrid or alternative objective functions.

Markdown Report Issue Upgrade to Chat

References (4)

MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies (2023)

Cross Entropy in Deep Learning of Classifiers Is Unnecessary -- ISBE Error is All You Need (2023)

The Advantage of Cross Entropy over Entropy in Iterative Information Gathering (2014)

Beyond KL-divergence: Risk Aware Control Through Cross Entropy and Adversarial Entropy Regularization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Forward Cross-Entropy Objective.

Forward Cross-Entropy in Machine Learning

1. Formal Definition and Autoregressive Model Context

2. Equivalence to Maximum Likelihood Estimation

3. Probabilistic and Information-Theoretic Interpretations

4. Variations and Computational Implementations

5. Limitations and Contrast with Reverse Cross-Entropy

6. Generalizations in Risk-Aware Control and Information Gathering

7. Computational and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Forward Cross-Entropy in Machine Learning

1. Formal Definition and Autoregressive Model Context

2. Equivalence to Maximum Likelihood Estimation

3. Probabilistic and Information-Theoretic Interpretations

4. Variations and Computational Implementations

5. Limitations and Contrast with Reverse Cross-Entropy

6. Generalizations in Risk-Aware Control and Information Gathering

7. Computational and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research