Normalized Maximum Likelihood (NML)

Updated 7 February 2026

Normalized Maximum Likelihood (NML) is a universal statistical modeling principle that normalizes the maximum likelihood over all data samples to achieve minimax optimal regret.
It underpins model selection under the Minimum Description Length principle by balancing data fit and model complexity through a rigorous normalization constant.
Extensions like luckiness-weighted NML and α-NML address computational challenges and non-existence issues in continuous and high-dimensional settings.

Normalized Maximum Likelihood (NML) is a universal statistical modeling principle that achieves minimax optimality with respect to regret by normalizing the maximum-likelihood function over all possible data samples. It provides an objective, non-asymptotic, and parameter-free foundation for model selection, coding, prediction, and evidence quantification under the Minimum Description Length (MDL) principle, and serves as the cornerstone of modern universal coding theory. NML’s minimax regret property, its computational and representational challenges in continuous domains, and its principled extensions (e.g., luckiness-weighted NML, quotient-NML, $\alpha$ -NML) have fueled a large literature spanning statistics, information theory, machine learning, and computational biology.

1. Formal Definition and Minimax Regret

Given a parametric family $\{ p(x^n; \theta) : \theta \in \Theta \}$ for samples $x^n \in \mathcal X^n$ , the normalized maximum likelihood density is

$p_{\mathrm{NML}}(x^n) = \frac{p(x^n; \hat\theta(x^n))}{\int_{y^n} p(y^n; \hat\theta(y^n))\, dy^n}, \qquad \hat\theta(x^n) = \arg\max_{\theta} p(x^n; \theta).$

Here, the denominator (the parametric complexity or normalization constant) aggregates the maximized likelihood over all possible data samples.

The key optimality property, as established by Shtarkov, is that $p_{\mathrm{NML}}$ uniquely achieves minimax pointwise regret: $p_{\mathrm{NML}} = \arg\min_{q(\cdot)} \sup_{x^n} \left[ -\log q(x^n) + \log p(x^n;\, \hat\theta(x^n)) \right].$ This yields a universal code—one that matches the ideal (oracle) code for every sample, up to a constant penalty $\log C_n$ independent of $x^n$ (Bickel, 2010, Barron et al., 2014, Rosas et al., 2020).

2. Properties and Regret Analysis

Stochastic Complexity: The negative log of the NML distribution provides a two-term code length,

$L_{\mathrm{NML}}(x^n) = -\log p(x^n ; \hat\theta(x^n)) + \log C_n,$

where $C_n$ is the parametric complexity.

Minimax Regret: For any code $q$ , the worst-case regret is lower bounded by $\log C_n$ ; NML achieves this worst-case bound equally for all $x^n$ .
Invariance: The NML criterion is invariant to reparametrization and data-space relabeling (for finite $\mathcal X^n$ ) (Boullé et al., 2016, Bickel, 2010).
Asymptotics: For regular $k$ -parameter exponential families,

$\log C_n = \frac{k}{2} \log \frac{n}{2\pi} + \log \int_\Theta \sqrt{\det I(\theta)} d\theta + o(1),$

with $I(\theta)$ the Fisher information (Suzuki et al., 2018, Bickel, 2010, Fukuzawa et al., 29 Aug 2025).

Finite Sample Effects: In small-sample regimes, the parametric complexity and stochastic complexity deviate from BIC-type penalizations and can influence practical model selection and hypothesis testing (Boullé et al., 2016, Tavory, 2018).

3. Computation in Practice and Extensions

NML’s generically intractable normalization—an exponentially large sum or high-dimensional integral—has motivated algorithmic strategies and theoretical extensions:

Domain/Model	Key Computation Approach	Citation
Finite discrete models	Direct sum, recurrences, asymptotics	(Boullé et al., 2016, Bickel, 2010)
Exponential families	Fourier analysis, saddlepoint, density of MLE	(Suzuki et al., 2018, Hirai et al., 2012, Suzuki et al., 2024)
Continuous models	Reparametrization, coarea formula, luckiness regularization	(Suzuki et al., 2024, Hirai et al., 2012, Miyaguchi, 2017)
Riemannian manifold data	Riemannian volume measures, asymptotic Fisher info	(Fukuzawa et al., 29 Aug 2025)
High-dimensional settings	Approximations, re-normalization (e.g., restricted domains for GMMs)	(Hirai et al., 2017)

With continuous and unbounded parameter spaces, the NML normalization may diverge; remedies include restricting the domain, introducing "luckiness" weight functions (LNML), or employing robust alternatives such as $\alpha$ -NML (Miyaguchi, 2017, Suzuki et al., 2024, Bondaschi et al., 2022).

4. Luckiness, Generalizations, and Surrogate Criteria

Luckiness-weighted NML (LNML): Augments the model class with a weight function $w(\theta)$ :

$p_{\mathrm{LNML}}(x^n) = \frac{p(x^n; \hat\theta(x^n))\, w(\hat\theta(x^n))}{\int p(y^n; \hat\theta(y^n))\, w(\hat\theta(y^n))\, dy^n}$

Uniquely minimax for weighted regret and enables NML-type inference when the ordinary normalization is infinite (Bickel, 2010, Miyaguchi, 2017, Bibas et al., 2022).

Quotient-NML (qNML): For Bayesian networks, quotient-NML constructs a decomposable, hyperparameter-free, and score-equivalent criterion using ratios of "local" 1D-NMLs (Silander et al., 2024).
$\alpha$ -NML: Generalizes NML to minimize Rényi-divergence–based regret, interpolating between mixture (Bayesian) and worst-case (NML) predictors, and robust when NML is inapplicable (Bondaschi et al., 2022).
Weighted NML (NMWL): Used for multiple hypothesis testing, incorporating side information or pseudo-data to enable robust discrimination information and control over multiple comparisons (Bickel, 2010).

5. NML and Model Selection

NML code length forms the foundation of objective, parameter-free model selection under the MDL principle. By encoding both data fit (via the maximized likelihood) and model complexity (via the parametric complexity term), the NML criterion embodies a rigorous Occam’s razor, penalizing over-flexible models more heavily than BIC/AIC in finite samples (Boullé et al., 2016, Rosas et al., 2020). It is used to select:

Model order in PCA: Closed-form NML bounds enable non-asymptotic rank selection (Tavory, 2018).
Number of clusters in GMMs: NML or re-normalized NML yields higher accuracy and robustness than classical information criteria (Hirai et al., 2012, Hirai et al., 2017).
Feature sets in maximum-entropy models: NML quantifies both complexity and fit, connecting to the minimax entropy principle (Pandey et al., 2012).

6. Sequential Prediction, Universal Coding, and Bayesian Connections

NML admits a (possibly signed) mixture representation over parameter values, bridging MDL and Bayesian approaches—even though it is not a posterior with respect to any genuine nonnegative prior. This decomposition enables linear-time computation of marginals and predictive distributions in exponential family models (Barron et al., 2014). NML-based predictors and classifiers are minimax regret optimal for universal prediction and coding, delivering finite-sample PAC-type guarantees and automatic regularization, especially in small-sample regimes (Rosas et al., 2020).

7. Theoretical Limitations and Geometry

Non-existence: NML (without regularization) is undefined for many continuous unbounded models, including univariate/multivariate Gaussians, due to normalization divergence (Miyaguchi, 2017, Suzuki et al., 2024).
Geometric Measure Theory: The rigorous extension of NML normalization to continuous models requires the coarea formula, accounting for the pushforward density of the MLE and avoiding the failure of naive integral decompositions (Suzuki et al., 2024).
Riemannian NML: For data in non-Euclidean spaces (e.g., hyperbolic embeddings), NML must be defined over the intrinsic Riemannian volume, with correspondingly invariant code lengths and Fisher information geometry (Fukuzawa et al., 29 Aug 2025).

8. Applications and Empirical Impact

NML underpins compression-based learning, supervised classification with optimal small-sample risk, robust multiple hypothesis testing, objective model selection for PCA/rank selection, maximum entropy modeling, and structure learning in graphical models. NML and its variants, such as luckiness-NML and $\alpha$ -NML, exhibit superior robustness, improved sample efficiency, principled regularization, and minimax guarantees in both theoretical analysis and empirical benchmarks (Bibas et al., 2022, Rosas et al., 2020, Fukuzawa et al., 29 Aug 2025, Silander et al., 2024, Boullé et al., 2016).

In summary, NML furnishes a universal, minimax-regret–optimal, and formidably general framework for model complexity quantification, statistical inference, and predictive modeling across discrete and continuous, parametric and nonparametric, Euclidean and non-Euclidean domains. Its extensions and algorithmic refinements address intractability and non-existence in high-dimensional, unbounded, or geometrically structured settings, cementing its central position in contemporary statistical modeling and information-theoretic inference.

Markdown Upgrade to Chat

References (15)

Statistical inference optimized with respect to the observed sample for single or multiple comparisons (2010)

Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation (2014)

Learning, compression, and leakage: Minimising classification error via meta-universal compression principles (2020)

Revisiting enumerative two-part crude MDL for Bernoulli and multinomial distributions (Extended version) (2016)

Exact Calculation of Normalized Maximum Likelihood Code Length Using Fourier Analysis (2018)

Normalized Maximum Likelihood Code-Length on Riemannian Manifold Data Spaces (2025)

Determining Principal Component Cardinality through the Principle of Minimum Description Length (2018)

Normalized Maximum Likelihood Coding for Exponential Family with Its Applications to Optimal Clustering (2012)

Foundation of Calculating Normalized Maximum Likelihood for Continuous Probability Models (2024)

10.

Normalized Maximum Likelihood with Luckiness for Multivariate Normal Distributions (2017)

11.

Upper Bound on Normalized Maximum Likelihood Codes for Gaussian Mixture Models (2017)

12.

Alpha-NML Universal Predictors (2022)

13.

Beyond Ridge Regression for Distribution-Free Data (2022)

14.

Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures (2024)

15.

Minimum Description Length Principle for Maximum Entropy Model Selection (2012)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Normalized Maximum Likelihood (NML).

Normalized Maximum Likelihood (NML)

1. Formal Definition and Minimax Regret

2. Properties and Regret Analysis

3. Computation in Practice and Extensions

4. Luckiness, Generalizations, and Surrogate Criteria

5. NML and Model Selection

6. Sequential Prediction, Universal Coding, and Bayesian Connections

7. Theoretical Limitations and Geometry

8. Applications and Empirical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Normalized Maximum Likelihood (NML)

1. Formal Definition and Minimax Regret

2. Properties and Regret Analysis

3. Computation in Practice and Extensions

4. Luckiness, Generalizations, and Surrogate Criteria

5. NML and Model Selection

6. Sequential Prediction, Universal Coding, and Bayesian Connections

7. Theoretical Limitations and Geometry

8. Applications and Empirical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research