Normalized Maximum Likelihood (NML)
- Normalized Maximum Likelihood (NML) is a universal statistical modeling principle that normalizes the maximum likelihood over all data samples to achieve minimax optimal regret.
- It underpins model selection under the Minimum Description Length principle by balancing data fit and model complexity through a rigorous normalization constant.
- Extensions like luckiness-weighted NML and α-NML address computational challenges and non-existence issues in continuous and high-dimensional settings.
Normalized Maximum Likelihood (NML) is a universal statistical modeling principle that achieves minimax optimality with respect to regret by normalizing the maximum-likelihood function over all possible data samples. It provides an objective, non-asymptotic, and parameter-free foundation for model selection, coding, prediction, and evidence quantification under the Minimum Description Length (MDL) principle, and serves as the cornerstone of modern universal coding theory. NML’s minimax regret property, its computational and representational challenges in continuous domains, and its principled extensions (e.g., luckiness-weighted NML, quotient-NML, -NML) have fueled a large literature spanning statistics, information theory, machine learning, and computational biology.
1. Formal Definition and Minimax Regret
Given a parametric family for samples , the normalized maximum likelihood density is
Here, the denominator (the parametric complexity or normalization constant) aggregates the maximized likelihood over all possible data samples.
The key optimality property, as established by Shtarkov, is that uniquely achieves minimax pointwise regret: This yields a universal code—one that matches the ideal (oracle) code for every sample, up to a constant penalty independent of (Bickel, 2010, Barron et al., 2014, Rosas et al., 2020).
2. Properties and Regret Analysis
- Stochastic Complexity: The negative log of the NML distribution provides a two-term code length,
where is the parametric complexity.
- Minimax Regret: For any code , the worst-case regret is lower bounded by ; NML achieves this worst-case bound equally for all .
- Invariance: The NML criterion is invariant to reparametrization and data-space relabeling (for finite ) (Boullé et al., 2016, Bickel, 2010).
- Asymptotics: For regular -parameter exponential families,
with the Fisher information (Suzuki et al., 2018, Bickel, 2010, Fukuzawa et al., 29 Aug 2025).
- Finite Sample Effects: In small-sample regimes, the parametric complexity and stochastic complexity deviate from BIC-type penalizations and can influence practical model selection and hypothesis testing (Boullé et al., 2016, Tavory, 2018).
3. Computation in Practice and Extensions
NML’s generically intractable normalization—an exponentially large sum or high-dimensional integral—has motivated algorithmic strategies and theoretical extensions:
| Domain/Model | Key Computation Approach | Citation |
|---|---|---|
| Finite discrete models | Direct sum, recurrences, asymptotics | (Boullé et al., 2016, Bickel, 2010) |
| Exponential families | Fourier analysis, saddlepoint, density of MLE | (Suzuki et al., 2018, Hirai et al., 2012, Suzuki et al., 2024) |
| Continuous models | Reparametrization, coarea formula, luckiness regularization | (Suzuki et al., 2024, Hirai et al., 2012, Miyaguchi, 2017) |
| Riemannian manifold data | Riemannian volume measures, asymptotic Fisher info | (Fukuzawa et al., 29 Aug 2025) |
| High-dimensional settings | Approximations, re-normalization (e.g., restricted domains for GMMs) | (Hirai et al., 2017) |
With continuous and unbounded parameter spaces, the NML normalization may diverge; remedies include restricting the domain, introducing "luckiness" weight functions (LNML), or employing robust alternatives such as -NML (Miyaguchi, 2017, Suzuki et al., 2024, Bondaschi et al., 2022).
4. Luckiness, Generalizations, and Surrogate Criteria
- Luckiness-weighted NML (LNML): Augments the model class with a weight function :
Uniquely minimax for weighted regret and enables NML-type inference when the ordinary normalization is infinite (Bickel, 2010, Miyaguchi, 2017, Bibas et al., 2022).
- Quotient-NML (qNML): For Bayesian networks, quotient-NML constructs a decomposable, hyperparameter-free, and score-equivalent criterion using ratios of "local" 1D-NMLs (Silander et al., 2024).
- -NML: Generalizes NML to minimize Rényi-divergence–based regret, interpolating between mixture (Bayesian) and worst-case (NML) predictors, and robust when NML is inapplicable (Bondaschi et al., 2022).
- Weighted NML (NMWL): Used for multiple hypothesis testing, incorporating side information or pseudo-data to enable robust discrimination information and control over multiple comparisons (Bickel, 2010).
5. NML and Model Selection
NML code length forms the foundation of objective, parameter-free model selection under the MDL principle. By encoding both data fit (via the maximized likelihood) and model complexity (via the parametric complexity term), the NML criterion embodies a rigorous Occam’s razor, penalizing over-flexible models more heavily than BIC/AIC in finite samples (Boullé et al., 2016, Rosas et al., 2020). It is used to select:
- Model order in PCA: Closed-form NML bounds enable non-asymptotic rank selection (Tavory, 2018).
- Number of clusters in GMMs: NML or re-normalized NML yields higher accuracy and robustness than classical information criteria (Hirai et al., 2012, Hirai et al., 2017).
- Feature sets in maximum-entropy models: NML quantifies both complexity and fit, connecting to the minimax entropy principle (Pandey et al., 2012).
6. Sequential Prediction, Universal Coding, and Bayesian Connections
NML admits a (possibly signed) mixture representation over parameter values, bridging MDL and Bayesian approaches—even though it is not a posterior with respect to any genuine nonnegative prior. This decomposition enables linear-time computation of marginals and predictive distributions in exponential family models (Barron et al., 2014). NML-based predictors and classifiers are minimax regret optimal for universal prediction and coding, delivering finite-sample PAC-type guarantees and automatic regularization, especially in small-sample regimes (Rosas et al., 2020).
7. Theoretical Limitations and Geometry
- Non-existence: NML (without regularization) is undefined for many continuous unbounded models, including univariate/multivariate Gaussians, due to normalization divergence (Miyaguchi, 2017, Suzuki et al., 2024).
- Geometric Measure Theory: The rigorous extension of NML normalization to continuous models requires the coarea formula, accounting for the pushforward density of the MLE and avoiding the failure of naive integral decompositions (Suzuki et al., 2024).
- Riemannian NML: For data in non-Euclidean spaces (e.g., hyperbolic embeddings), NML must be defined over the intrinsic Riemannian volume, with correspondingly invariant code lengths and Fisher information geometry (Fukuzawa et al., 29 Aug 2025).
8. Applications and Empirical Impact
NML underpins compression-based learning, supervised classification with optimal small-sample risk, robust multiple hypothesis testing, objective model selection for PCA/rank selection, maximum entropy modeling, and structure learning in graphical models. NML and its variants, such as luckiness-NML and -NML, exhibit superior robustness, improved sample efficiency, principled regularization, and minimax guarantees in both theoretical analysis and empirical benchmarks (Bibas et al., 2022, Rosas et al., 2020, Fukuzawa et al., 29 Aug 2025, Silander et al., 2024, Boullé et al., 2016).
In summary, NML furnishes a universal, minimax-regret–optimal, and formidably general framework for model complexity quantification, statistical inference, and predictive modeling across discrete and continuous, parametric and nonparametric, Euclidean and non-Euclidean domains. Its extensions and algorithmic refinements address intractability and non-existence in high-dimensional, unbounded, or geometrically structured settings, cementing its central position in contemporary statistical modeling and information-theoretic inference.