Papers
Topics
Authors
Recent
2000 character limit reached

Normalized Maximum Likelihood Distribution

Updated 6 February 2026
  • Normalized Maximum Likelihood (NML) is a universal coding method that defines probability via maximized likelihood and normalization (the Shtarkov sum), balancing model fit with intrinsic complexity.
  • NML achieves minimax regret by normalizing over all possible data sequences, forming the basis for the MDL principle and providing robust, scale-invariant model selection.
  • Extensions such as weighted NML, α-NML, and pNML broaden its applicability, particularly in predictive modeling and deep learning, by addressing computational challenges and enhancing inference accuracy.

The normalized maximum likelihood (NML) distribution is a foundational construct in universal coding, model selection, and statistical inference, providing a formal mechanism for balancing model fit against the intrinsic complexity of a model family. It uniquely achieves minimax regret in data compression and prediction and underpins the modern minimum description length (MDL) principle. The NML is defined via maximization and normalization over all possible data realizations, with the normalization constant—known as the Shtarkov sum—quantifying parametric complexity. In both discrete and continuous settings, NML tightly links information theory, statistics, and learning theory; it gives rise to practical inference criteria and inspires new developments through its predictive and generalized forms.

1. Formal Definition and Minimax Regret

Let PΘ={pθ(x):θΘ}\mathcal{P}_\Theta = \{p_\theta(x) : \theta \in \Theta\} denote a parametric model family on data xXx \in \mathcal{X}. For an observed sample (or sequence) xnx^n, the NML density is given by

pNML(xn)=pθ^(xn)(xn)Cnwhereθ^(xn)=argmaxθΘpθ(xn),p_{\mathrm{NML}}(x^n) = \frac{p_{\hat\theta(x^n)}(x^n)}{C_n} \quad \text{where} \quad \hat\theta(x^n) = \arg\max_{\theta \in \Theta} p_\theta(x^n),

with normalization

Cn=ynXnmaxθΘpθ(yn)(discrete)orCn=Xnpθ^(yn)(yn)dyn(continuous)C_n = \sum_{y^n \in \mathcal{X}^n} \max_{\theta \in \Theta} p_\theta(y^n) \quad \text{(discrete)} \qquad \text{or} \qquad C_n = \int_{\mathcal{X}^n} p_{\hat\theta(y^n)}(y^n) dy^n \quad \text{(continuous)}

(Bickel, 2010, Suzuki et al., 2018, Suzuki et al., 2024).

The NML achieves the minimax regret criterion: minqmaxxn[logq(xn)+minθlog1pθ(xn)],\min_{q} \max_{x^n} \Big[ -\log q(x^n) + \min_{\theta} \log \frac{1}{p_\theta(x^n)} \Big], ensuring that for any data sequence, the excess codelength over the best model in the class is bounded by logCn\log C_n, a data-independent constant. No explicit prior is assumed, and the optimal qq^* is precisely pNMLp_{\mathrm{NML}} (Barron et al., 2014, Bondaschi et al., 2022).

2. Parametric Complexity and the Shtarkov Sum

The normalization CnC_n—the Shtarkov sum—quantifies the capacity of the model class to fit all possible data sequences:

  • In discrete models, CnC_n is a finite sum over the countable sample space and can often be computed directly or approximated asymptotically (Boullé et al., 2016, Heck et al., 2014).
  • In continuous models, CnC_n is an integral that generally diverges unless the data domain or parameter range is restricted. The foundation for its rigorous computation in continuous spaces has been established using the coarea formula from geometric measure theory, which decomposes the space according to the MLE mapping and incorporates a Jacobian determinant as a correction (Suzuki et al., 2024). Specifically,

Cn=Θpθ^(θ)dθ,C_n = \int_\Theta p_{\hat\theta\sharp}(\theta) d\theta,

where pθ^(θ)p_{\hat\theta\sharp}(\theta) is determined via Hausdorff measures on the level sets of the estimator.

In the MDL context, the total stochastic complexity of xnx^n under NML splits as

LNML(xn)=logp(xn;θ^(xn))+logCn,L_{\mathrm{NML}}(x^n) = -\log p(x^n; \hat\theta(x^n)) + \log C_n,

with the first term assessing model fit and the second penalizing model complexity (Boullé et al., 2016, Suzuki et al., 2018).

3. Asymptotics and Analytic Calculation

In regular exponential families, the asymptotic behavior of CnC_n is

logCn=d2log(n2π)+logΘdetI(θ)dθ+o(1),\log C_n = \frac{d}{2} \log\left(\frac{n}{2\pi}\right) + \log \int_\Theta \sqrt{\det I(\theta)} d\theta + o(1),

where dd is the parameter dimension and I(θ)I(\theta) the Fisher information (Bondaschi et al., 2022, Suzuki et al., 2018, Bickel, 2010). Recent advances have produced efficient, non-asymptotic formulas for CnC_n in exponential families using Fourier analysis, which transforms the max-over-θ\theta operation into a tractable integral (Suzuki et al., 2018). For discrete families such as Bernoulli or multinomial, the computation reduces to finite sums over data summaries (sufficient statistics); for mixtures and continuous exponential families, explicit reparametrization or compactification is generally necessary to guarantee finiteness (Hirai et al., 2012, Hirai et al., 2017, Suzuki et al., 2024).

4. Extensions: Weighted and Luckiness NML

Canonical NML is undefined for many unbounded or singular models (including most location families and overparameterized settings). Weighted NML (WNML) or “Luckiness NML” remedies this by introducing a weighting (“luckiness”) function π(θ)\pi(\theta) in both numerator and denominator: pLNML(xn)=supθ[π(θ)pθ(xn)]ynsupθ[π(θ)pθ(yn)]p_{\mathrm{LNML}}(x^n) = \frac{\sup_{\theta} [\pi(\theta) p_\theta(x^n)]}{\sum_{y^n} \sup_\theta [\pi(\theta) p_\theta(y^n)]} (Bondaschi et al., 2022, Bibas et al., 2022, Bickel, 2010). This approach regularizes the effective model class (e.g., equivalent to ridge regression for linear models under 2\ell_2 luckiness), sidestepping the divergence of CnC_n inherent in unconstrained continuous families.

The α-NML predictor further interpolates between Bayes-mix (average regret, α=1), NML (worst-case regret, α→∞), and LNML (weighted models), optimizing Rényi divergence of order α as the regret criterion (Bondaschi et al., 2022).

5. Predictive NML and Deep Learning

The predictive NML (pNML) is designed for supervised settings, defining for a new input xx (with training set zNz^N) the predictive distribution: ppNML(yx;zN)=pθ^(zN,x,y)(yx)yYpθ^(zN,x,y)(yx)p_{\mathrm{pNML}}(y|x; z^N) = \frac{p_{\hat\theta(z^N, x, y)}(y|x)}{\sum_{y' \in \mathcal{Y}} p_{\hat\theta(z^N, x, y')}(y'|x)} where θ^(zN,x,y)\hat\theta(z^N, x, y) is the ML estimate on zNz^N with (x,y)(x,y) appended (Bibas et al., 2019, Bibas et al., 2022). pNML achieves minimax pointwise regret for individual test samples, yielding strong guarantees on confidence and robustness.

In deep networks, retraining for all candidate labels is intractable. The Deep-pNML method approximates pNML by fine-tuning only the final classification layer for each label hypothesis, yielding robust calibration, improved OOD detection, and adversarial robustness—particularly when the regret metric R(zN,x)R(z^N,x) spikes for low-confidence or adversarial examples (Bibas et al., 2019).

6. Model Selection and Practical Applications

NML and its variants define core model selection criteria within MDL. The MDL optimal model (e.g., the number of clusters in GMM) minimizes the NML code-length LNML(xn;M)L_{\mathrm{NML}}(x^n; M). For Gaussian mixtures, practical computation of the normalizer requires domain restriction (e.g., bounding the norm of means and eigenvalues of covariances), with well-defined upper bounds on NML code-length providing universal criteria that are invariant under scaling and robust to domain parametrization (Hirai et al., 2012, Hirai et al., 2017).

Further, discrimination information (DI) defined as the log-ratio of NML codes for competing hypotheses directly quantifies the strength of evidence, with strong asymptotic properties—unlike p-values, DI vanishes in probability under the null and requires neither priors nor averaging over hypothetical data (Bickel, 2010).

7. Bayesian Properties and Connections

Although NML is rooted in minimax regret and universal compression, it admits a Bayes-like mixture representation: pNML(xn)=kp(xn;θk)Wk,n/Cn,p_{\mathrm{NML}}(x^n) = \sum_k p(x^n; \theta_k) W_{k,n} / C_n, where the weights Wk,nW_{k,n} may be positive or negative. For certain one-dimensional exponential families, positive weights are possible; generally, a signed prior is required. This representation clarifies the relationship between MDL and Bayesian inference and also provides computational benefits, enabling fast marginal and conditional calculation for coding and prediction (Barron et al., 2014). Asymptotically, NML and Bayesian mixtures with Jeffreys prior coincide, but for finite samples, differences persist, especially under sharp model constraints or order restrictions (Heck et al., 2014).

Table: Core NML Variants and Their Domains

Variant Definition/Formula Typical Use
NML pNML(xn)=pθ^(xn)(xn)/Cnp_{\mathrm{NML}}(x^n) = p_{\hat\theta(x^n)}(x^n)/C_n Minimax regret, finite families
LNML / WNML Weighted MLEs; denominator uses weights or prior-like function Divergent/overparametrized models
α-NML Minimizes Rényi-regret; interpolates Bayes and NML Trade-off avg/worst-case regret
pNML ppNML(yx;zN)=pθ^(zN,x,y)(yx)/ypθ^(zN,x,y)(yx)p_{\mathrm{pNML}}(y|x;z^N) = p_{\hat\theta(z^N,x,y)}(y|x)/\sum_{y'}p_{\hat\theta(z^N,x,y')}(y'|x) Predictive/distribution-free learning

References

  • “Alpha-NML Universal Predictors” (Bondaschi et al., 2022)
  • “Deep pNML: Predictive Normalized Maximum Likelihood for Deep Neural Networks” (Bibas et al., 2019)
  • “Foundation of Calculating Normalized Maximum Likelihood for Continuous Probability Models” (Suzuki et al., 2024)
  • “Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation” (Barron et al., 2014)
  • “Statistical inference optimized with respect to the observed sample for single or multiple comparisons” (Bickel, 2010)
  • “Beyond Ridge Regression for Distribution-Free Data” (Bibas et al., 2022)
  • “Normalized Maximum Likelihood Coding for Exponential Family with Its Applications to Optimal Clustering” (Hirai et al., 2012)
  • “Upper Bound on Normalized Maximum Likelihood Codes for Gaussian Mixture Models” (Hirai et al., 2017)
  • “Revisiting enumerative two-part crude MDL for Bernoulli and multinomial distributions” (Boullé et al., 2016)
  • “Exact Calculation of Normalized Maximum Likelihood Code Length Using Fourier Analysis” (Suzuki et al., 2018)
  • “Testing Order Constraints: Qualitative Differences Between Bayes Factors and Normalized Maximum Likelihood” (Heck et al., 2014)

The NML framework thus constitutes a rigorous, computationally tractable cornerstone for universal coding, statistical learning, and robust inference, with ongoing extensions informing predictive modeling, regularization, and deep learning applications.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Normalized Maximum Likelihood (NML) Distribution.