Negative Log-Likelihood (NLL)

Updated 11 November 2025

Negative log-likelihood (NLL) is a core probability-based objective that minimizes the negative logarithm of observed data likelihood, forming the basis of maximum likelihood estimation.
It underpins methods in classification, generative modeling, and Bayesian inference, closely relating to cross-entropy loss and ensuring proper probability calibration.
Variants such as NLL ratio loss and β-NLL adjust for discriminative margins and uncertainty estimation, optimizing performance in deep learning and latent variable models.

Negative log-likelihood (NLL) is a foundational objective in probabilistic modeling and statistical learning, defined as the negative logarithm of the probability assigned to observed data under a parameterized model. Minimizing NLL, equivalent to maximizing likelihood, is central to maximum likelihood estimation (MLE) and forms the basis of numerous training paradigms in deep learning, supervised classification, generative modeling, and Bayesian inference. The NLL and its refinements underpin a spectrum of methods, including cross-entropy loss, discriminative margin-based criteria, preference optimization, uncertainty calibration, energy-based modeling, and information-theoretic analysis of model expressiveness.

1. Formal Definition, Classical Properties, and Interpretation

Let $p_\theta(\mathcal{D})$ denote the probability (density or mass) assigned by a parameterized model $p_\theta$ to a dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N$ . The average negative log-likelihood is

$L_{\text{NLL}}(\theta) = -\frac{1}{N} \sum_{i=1}^N \log p_\theta(y_i|x_i).$

In language modeling and sequence prediction, this expands as

$L_{\text{NLL}}(\theta) = -\sum_{t=1}^T \log p_\theta(y_t | y_{< t}, x),$

where each prediction is conditioned autoregressively.

NLL fulfills several optimality criteria under classical assumptions: if data are i.i.d. from the model's true distribution and the model's parameterization is sufficiently flexible, minimizing NLL yields a statistically consistent estimator and is a strictly proper local scoring rule, which ensures probability calibration (Li et al., 1 Oct 2025). In simple classification settings, it corresponds (up to constants) with cross-entropy loss. NLL is a convex surrogate for the 0–1 loss in multiclass classification, conferring robustness to local optimization.

2. Negative Log-Likelihood in Deep Classification and Discriminative Alternatives

For N-way classification with neural networks, the typical output is a softmax of logits $\{z_k\}_{k=1}^N$ :

$p(y=k|x) = \frac{\exp z_k}{\sum_{j=1}^N \exp z_j}$

The cross-entropy loss,

$L_\text{ce} = -\sum_{i=1}^N y_i \log p(y_i|x),$

is equivalent to NLL under assumptions of uniform class priors and features (Zhu et al., 2018). However, in standard form, CE/NLL only aims to maximize the probability of the correct class and does not explicitly suppress competitors.

The negative log-likelihood ratio (NLLR) loss is proposed for improved discriminative performance:

$L_\text{nllr} = -\log \frac{p(y|x)}{\sum_{k\neq y} p(k|x)} = -\log p(y|x) + \log \left(\sum_{k\neq y} p(k|x)\right).$

NLLR operationalizes margin maximization by explicitly comparing the probability of the true class to all alternatives, producing sharper class boundaries. Empirical results on CIFAR-10 demonstrate that NLLR yields lower classification error (∼5.7%) compared to cross-entropy (∼6.2%) under identical architectures and training regimes. The method introduces no architectural or computational overhead beyond a loss function swap, making it viable when maximizing inter-class discriminability is paramount, such as in fine-grained classification (Zhu et al., 2018).

3. Negative Log-Likelihood in LLM Fine-Tuning and Probability-based Objectives

In post-training or supervised fine-tuning (SFT) of LLMs, NLL is widely deployed, but its optimality is not guaranteed outside classical assumptions (Li et al., 1 Oct 2025). Key deviations are pre-existing, strongly biased priors in the pretrained parameters and lengthy, often noisy, supervision signals. Under these settings, NLL's emphasis on low-probability tokens can misalign with model generalization by overfitting to annotation artifacts or uninformative trajectory portions.

A general class of objectives is parameterized by functions $f:[0,1]\to\mathbb{R}$ , reducing to NLL for $f(p) = -\log p$ . "Prior-leaning" alternatives, e.g., $f(p) = -p^α$ , reweight the objective toward high-probability tokens. Empirically, performance segregates along a "model-capability continuum":

Model-Strong: base model priors are accurate (mean $p_\theta \gg 0.7$ ); prior-leaning objectives outperform NLL by up to 16% in math reasoning and similarly structured domains (e.g., thresholded NLL, –p, –p¹⁰ variants).
Model-Weak: base priors are flat (mean $p_\theta \ll 0.2$ ); standard NLL is superior.
Model-Intermediate: performances of these objectives are nearly indistinguishable.

Theoretical analysis using gradient flows confirms that prior-leaning objectives accelerate risk reduction when the base model is already strong, but harm generalization in model-weak settings. Practically, hybrid strategies such as thresholded NLL (clipping low probabilities) or quantile token filtering offer improved fine-tuning performance by mitigating the impact of supervision noise (Li et al., 1 Oct 2025).

4. NLL in Latent Variable Modeling: Information-Theoretic Bounds and Rate-Distortion

In latent variable models, the NLL is

$L_\text{nll}(p) = -\frac{1}{N} \sum_{i=1}^N \log \int p(z) \ell(x_i|z) dz$

where $\ell(x|z)$ is the observation likelihood and $p(z)$ the prior. Optimizing $p(z)$ can be cast as a rate-distortion variational problem by taking the distortion $d(x, z) = -\log \ell(x|z)$ , yielding an equivalence between NLL minimization and classic rate-distortion Lagrangian duality (Lastras, 2019).

Information-theoretic analysis provides a sharp lower bound for NLL:

$\min_{p(z)} L_\text{nll}(p) \geq -\frac{1}{N} \sum_{i=1}^N \log p(x_i) - \sup_{z} \log \left[ \frac{1}{N} \sum_{i=1}^N \frac{\ell(x_i|z)}{p(x_i)} \right ]$

This "optimality gap" determines whether further improvements in NLL are possible via prior or likelihood modification. If the gap is nonzero, improvement is possible by either approach, and the bound is tight at optimal parameterization.

Empirically, on image datasets, variance in the bound's controlling statistic correlates with the extent to which improvements (e.g., VampPrior vs. standard prior, hierarchical vs. shallow architecture) drive test-set NLL down (Lastras, 2019).

5. NLL for Energy-Based and Unnormalized Models: Compositional Optimization

In energy-based models with unnormalized density $f_\theta(x) = \exp(-E_\theta(x))$ , NLL minimization

$L(\theta) = -\mathbb{E}_{\text{data}} [\log f_\theta(x)] + \log Z(\theta)$

is computationally challenging because $Z(\theta) = \int f_\theta(x) dx$ is intractable. Introducing a noise distribution $q(x)$ allows rewriting $Z(\theta) = \mathbb{E}_q[f_\theta(x)/q(x)]$ and $\log Z(\theta)$ becomes a compositional expectation.

By treating the objective as a two-level stochastic composition, algorithmic advances (MECO) with moving-average estimators yield unbiased or low-bias stochastic gradients and provably fast convergence rates under mild assumptions. On synthetic and vision benchmarks, this approach outperforms noise-contrastive estimation (NCE) in both learning efficiency and density estimation metrics, overcoming NCE's flat-loss pathology when the noise distribution $q$ is far from the data manifold (Jiang et al., 2023).

6. NLL in Preference Optimization, Contrastive Divergence, and Policy Alignment

Preference optimization (PO) for reinforcement learning from human or synthetic feedback employs an NLL objective over pairs, where

$p_\theta(y|u) = \frac{1}{Z_\theta(u)}\, \mu(y|u) \exp[ \beta r_\theta(u, y) ]$

with $r_\theta$ as the learned reward function. The gradient of the NLL decomposes into a data term and a model (partition) term:

$\nabla_\theta L = -\mathbb{E}_D[\nabla_\theta f_\theta(x)] + \mathbb{E}_{p_\theta}[\nabla_\theta f_\theta(x)]$

The intractability of the partition term motivates the use of Monte Carlo-Contrastive Divergence (MC-CD) as a sampling strategy for negative completions. Algorithm MC-PO, utilizing a one-step MCMC transition over sampled candidates, yields low-bias gradient estimators and outperforms previous margin-based and contrastive alignment methods on alignment benchmarks (Chen et al., 6 Feb 2025). In the online variant, unbiased gradient estimation leads to further improvement. Sampling proportional to $\exp(\beta r_\theta)$ , the MC kernel finds "hard negatives" that more faithfully approximate the NLL gradient, improving on alternative heuristics.

7. Heteroscedastic NLL for Uncertainty Estimation and Pitfalls in Optimization

For Gaussian conditional modeling with heteroscedastic variance,

$L_{\mathrm{NLL}}(\theta) = \sum_{i=1}^N \left[ \tfrac{1}{2} \log(2\pi \sigma_i^2) + \frac{(y_i-\mu_i)^2}{2\sigma_i^2} \right ]$

Minimizing this loss may cause optimization pathologies: if the network overestimates $\sigma^2$ on poorly predicted regions, their gradient contribution becomes negligible, impeding subsequent correction. This degeneracy is particularly acute when feature initialization lacks diversity or when optimization “locks in” early high-error regions with excessive variance (Seitzer et al., 2022).

To counteract this, the $\beta$ -NLL loss introduces a soft weighting,

$L_\beta(\theta) = \sum_{i=1}^N \sigma_i^{2\beta}\left[ \tfrac{1}{2} \log(2\pi \sigma_i^2) + \frac{(y_i-\mu_i)^2}{2\sigma_i^2} \right ]$

with stop-gradient on $\sigma_i^{2\beta}$ , distributing gradient updates more uniformly. Empirically, $\beta=0.5$ yields stable training, reduced RMSE, and improved calibration across regression, VAE, and depth estimation tasks. The mechanism ensures that earlier mis-fits cannot indefinitely suppress subsequent model correction. However, $\beta$ -NLL adjusts only aleatoric uncertainty estimation; epistemic uncertainty remains unaffected (Seitzer et al., 2022).

Negative log-likelihood is indispensable across domains of machine learning, generative modeling, and Bayesian inference, but its statistical properties, practical alignedness, and optimization landscape are deeply dependent on the underlying modeling paradigm, data priors, and application requirements. Careful design of the NLL objective—through discriminative ratios, convex-concave shaping, information-theoretic bounds, stochastic compositional optimization, or tailored variance weighting—directly impacts model expressiveness, calibration, generalization, and convergence behavior.

PDF Markdown Chat (Pro)

References (6)

Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum (2025)

Negative Log Likelihood Ratio Loss for Deep Neural Network Classification (2018)

Information Theoretic Lower Bounds on Negative Log Likelihood (2019)

Learning Unnormalized Statistical Models via Compositional Optimization (2023)

Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator (2025)

On the Pitfalls of Heteroscedastic Uncertainty Estimation with Probabilistic Neural Networks (2022)

Follow Topic

Get notified by email when new papers are published related to Negative Log-Likelihood (NLL).