Normalized Maximum Likelihood Distribution

Updated 6 February 2026

Normalized Maximum Likelihood (NML) is a universal coding method that defines probability via maximized likelihood and normalization (the Shtarkov sum), balancing model fit with intrinsic complexity.
NML achieves minimax regret by normalizing over all possible data sequences, forming the basis for the MDL principle and providing robust, scale-invariant model selection.
Extensions such as weighted NML, α-NML, and pNML broaden its applicability, particularly in predictive modeling and deep learning, by addressing computational challenges and enhancing inference accuracy.

The normalized maximum likelihood (NML) distribution is a foundational construct in universal coding, model selection, and statistical inference, providing a formal mechanism for balancing model fit against the intrinsic complexity of a model family. It uniquely achieves minimax regret in data compression and prediction and underpins the modern minimum description length (MDL) principle. The NML is defined via maximization and normalization over all possible data realizations, with the normalization constant—known as the Shtarkov sum—quantifying parametric complexity. In both discrete and continuous settings, NML tightly links information theory, statistics, and learning theory; it gives rise to practical inference criteria and inspires new developments through its predictive and generalized forms.

1. Formal Definition and Minimax Regret

Let $\mathcal{P}_\Theta = \{p_\theta(x) : \theta \in \Theta\}$ denote a parametric model family on data $x \in \mathcal{X}$ . For an observed sample (or sequence) $x^n$ , the NML density is given by

$p_{\mathrm{NML}}(x^n) = \frac{p_{\hat\theta(x^n)}(x^n)}{C_n} \quad \text{where} \quad \hat\theta(x^n) = \arg\max_{\theta \in \Theta} p_\theta(x^n),$

with normalization

$C_n = \sum_{y^n \in \mathcal{X}^n} \max_{\theta \in \Theta} p_\theta(y^n) \quad \text{(discrete)} \qquad \text{or} \qquad C_n = \int_{\mathcal{X}^n} p_{\hat\theta(y^n)}(y^n) dy^n \quad \text{(continuous)}$

(Bickel, 2010, Suzuki et al., 2018, Suzuki et al., 2024).

The NML achieves the minimax regret criterion: $\min_{q} \max_{x^n} \Big[ -\log q(x^n) + \min_{\theta} \log \frac{1}{p_\theta(x^n)} \Big],$ ensuring that for any data sequence, the excess codelength over the best model in the class is bounded by $\log C_n$ , a data-independent constant. No explicit prior is assumed, and the optimal $q^*$ is precisely $p_{\mathrm{NML}}$ (Barron et al., 2014, Bondaschi et al., 2022).

2. Parametric Complexity and the Shtarkov Sum

The normalization $C_n$ —the Shtarkov sum—quantifies the capacity of the model class to fit all possible data sequences:

In discrete models, $C_n$ is a finite sum over the countable sample space and can often be computed directly or approximated asymptotically (Boullé et al., 2016, Heck et al., 2014).
In continuous models, $C_n$ is an integral that generally diverges unless the data domain or parameter range is restricted. The foundation for its rigorous computation in continuous spaces has been established using the coarea formula from geometric measure theory, which decomposes the space according to the MLE mapping and incorporates a Jacobian determinant as a correction (Suzuki et al., 2024). Specifically,

$C_n = \int_\Theta p_{\hat\theta\sharp}(\theta) d\theta,$

where $p_{\hat\theta\sharp}(\theta)$ is determined via Hausdorff measures on the level sets of the estimator.

In the MDL context, the total stochastic complexity of $x^n$ under NML splits as

$L_{\mathrm{NML}}(x^n) = -\log p(x^n; \hat\theta(x^n)) + \log C_n,$

with the first term assessing model fit and the second penalizing model complexity (Boullé et al., 2016, Suzuki et al., 2018).

3. Asymptotics and Analytic Calculation

In regular exponential families, the asymptotic behavior of $C_n$ is

$\log C_n = \frac{d}{2} \log\left(\frac{n}{2\pi}\right) + \log \int_\Theta \sqrt{\det I(\theta)} d\theta + o(1),$

where $d$ is the parameter dimension and $I(\theta)$ the Fisher information (Bondaschi et al., 2022, Suzuki et al., 2018, Bickel, 2010). Recent advances have produced efficient, non-asymptotic formulas for $C_n$ in exponential families using Fourier analysis, which transforms the max-over- $\theta$ operation into a tractable integral (Suzuki et al., 2018). For discrete families such as Bernoulli or multinomial, the computation reduces to finite sums over data summaries (sufficient statistics); for mixtures and continuous exponential families, explicit reparametrization or compactification is generally necessary to guarantee finiteness (Hirai et al., 2012, Hirai et al., 2017, Suzuki et al., 2024).

4. Extensions: Weighted and Luckiness NML

Canonical NML is undefined for many unbounded or singular models (including most location families and overparameterized settings). Weighted NML (WNML) or “Luckiness NML” remedies this by introducing a weighting (“luckiness”) function $\pi(\theta)$ in both numerator and denominator: $p_{\mathrm{LNML}}(x^n) = \frac{\sup_{\theta} [\pi(\theta) p_\theta(x^n)]}{\sum_{y^n} \sup_\theta [\pi(\theta) p_\theta(y^n)]}$ (Bondaschi et al., 2022, Bibas et al., 2022, Bickel, 2010). This approach regularizes the effective model class (e.g., equivalent to ridge regression for linear models under $\ell_2$ luckiness), sidestepping the divergence of $C_n$ inherent in unconstrained continuous families.

The α-NML predictor further interpolates between Bayes-mix (average regret, α=1), NML (worst-case regret, α→∞), and LNML (weighted models), optimizing Rényi divergence of order α as the regret criterion (Bondaschi et al., 2022).

5. Predictive NML and Deep Learning

The predictive NML (pNML) is designed for supervised settings, defining for a new input $x$ (with training set $z^N$ ) the predictive distribution: $p_{\mathrm{pNML}}(y|x; z^N) = \frac{p_{\hat\theta(z^N, x, y)}(y|x)}{\sum_{y' \in \mathcal{Y}} p_{\hat\theta(z^N, x, y')}(y'|x)}$ where $\hat\theta(z^N, x, y)$ is the ML estimate on $z^N$ with $(x,y)$ appended (Bibas et al., 2019, Bibas et al., 2022). pNML achieves minimax pointwise regret for individual test samples, yielding strong guarantees on confidence and robustness.

In deep networks, retraining for all candidate labels is intractable. The Deep-pNML method approximates pNML by fine-tuning only the final classification layer for each label hypothesis, yielding robust calibration, improved OOD detection, and adversarial robustness—particularly when the regret metric $R(z^N,x)$ spikes for low-confidence or adversarial examples (Bibas et al., 2019).

6. Model Selection and Practical Applications

NML and its variants define core model selection criteria within MDL. The MDL optimal model (e.g., the number of clusters in GMM) minimizes the NML code-length $L_{\mathrm{NML}}(x^n; M)$ . For Gaussian mixtures, practical computation of the normalizer requires domain restriction (e.g., bounding the norm of means and eigenvalues of covariances), with well-defined upper bounds on NML code-length providing universal criteria that are invariant under scaling and robust to domain parametrization (Hirai et al., 2012, Hirai et al., 2017).

Further, discrimination information (DI) defined as the log-ratio of NML codes for competing hypotheses directly quantifies the strength of evidence, with strong asymptotic properties—unlike p-values, DI vanishes in probability under the null and requires neither priors nor averaging over hypothetical data (Bickel, 2010).

7. Bayesian Properties and Connections

Although NML is rooted in minimax regret and universal compression, it admits a Bayes-like mixture representation: $p_{\mathrm{NML}}(x^n) = \sum_k p(x^n; \theta_k) W_{k,n} / C_n,$ where the weights $W_{k,n}$ may be positive or negative. For certain one-dimensional exponential families, positive weights are possible; generally, a signed prior is required. This representation clarifies the relationship between MDL and Bayesian inference and also provides computational benefits, enabling fast marginal and conditional calculation for coding and prediction (Barron et al., 2014). Asymptotically, NML and Bayesian mixtures with Jeffreys prior coincide, but for finite samples, differences persist, especially under sharp model constraints or order restrictions (Heck et al., 2014).

Table: Core NML Variants and Their Domains

Variant	Definition/Formula	Typical Use
NML	$p_{\mathrm{NML}}(x^n) = p_{\hat\theta(x^n)}(x^n)/C_n$	Minimax regret, finite families
LNML / WNML	Weighted MLEs; denominator uses weights or prior-like function	Divergent/overparametrized models
α-NML	Minimizes Rényi-regret; interpolates Bayes and NML	Trade-off avg/worst-case regret
pNML	$p_{\mathrm{pNML}}(y\|x;z^N) = p_{\hat\theta(z^N,x,y)}(y\|x)/\sum_{y'}p_{\hat\theta(z^N,x,y')}(y'\|x)$	Predictive/distribution-free learning

References

“Alpha-NML Universal Predictors” (Bondaschi et al., 2022)
“Deep pNML: Predictive Normalized Maximum Likelihood for Deep Neural Networks” (Bibas et al., 2019)
“Foundation of Calculating Normalized Maximum Likelihood for Continuous Probability Models” (Suzuki et al., 2024)
“Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation” (Barron et al., 2014)
“Statistical inference optimized with respect to the observed sample for single or multiple comparisons” (Bickel, 2010)
“Beyond Ridge Regression for Distribution-Free Data” (Bibas et al., 2022)
“Normalized Maximum Likelihood Coding for Exponential Family with Its Applications to Optimal Clustering” (Hirai et al., 2012)
“Upper Bound on Normalized Maximum Likelihood Codes for Gaussian Mixture Models” (Hirai et al., 2017)
“Revisiting enumerative two-part crude MDL for Bernoulli and multinomial distributions” (Boullé et al., 2016)
“Exact Calculation of Normalized Maximum Likelihood Code Length Using Fourier Analysis” (Suzuki et al., 2018)
“Testing Order Constraints: Qualitative Differences Between Bayes Factors and Normalized Maximum Likelihood” (Heck et al., 2014)

The NML framework thus constitutes a rigorous, computationally tractable cornerstone for universal coding, statistical learning, and robust inference, with ongoing extensions informing predictive modeling, regularization, and deep learning applications.

Markdown Upgrade to Chat

References (11)

Statistical inference optimized with respect to the observed sample for single or multiple comparisons (2010)

Exact Calculation of Normalized Maximum Likelihood Code Length Using Fourier Analysis (2018)

Foundation of Calculating Normalized Maximum Likelihood for Continuous Probability Models (2024)

Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation (2014)

Alpha-NML Universal Predictors (2022)

Revisiting enumerative two-part crude MDL for Bernoulli and multinomial distributions (Extended version) (2016)

Testing Order Constraints: Qualitative Differences Between Bayes Factors and Normalized Maximum Likelihood (2014)

Normalized Maximum Likelihood Coding for Exponential Family with Its Applications to Optimal Clustering (2012)

Upper Bound on Normalized Maximum Likelihood Codes for Gaussian Mixture Models (2017)

10.

Beyond Ridge Regression for Distribution-Free Data (2022)

11.

Deep pNML: Predictive Normalized Maximum Likelihood for Deep Neural Networks (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Normalized Maximum Likelihood (NML) Distribution.