Normalized Maximum Likelihood (NML) Code-Length

Updated 11 April 2026

Normalized Maximum Likelihood (NML) code-length is defined as the negative log-likelihood plus a complexity term, ensuring minimax regret optimality in model selection.
It extends to both discrete and continuous models via summation or integration techniques, leveraging tools like the coarea formula for accurate computation.
NML underpins MDL-based model selection in various domains by balancing data fit and model complexity, and by offering connections to Bayesian mixture representations.

The normalized maximum likelihood (NML) code-length is the canonical criterion in the minimum description length (MDL) framework for universal model selection and data compression. Defined for a parametric model by combining the maximum likelihood fit to the observed data with a universal penalty—the parametric complexity—NML achieves minimax optimality with respect to worst-case code-length regret. Its theoretical definition, exact formulas, integral representations, advanced computation methods, and implications for the tractability and optimality of statistical model selection are central to information-theoretic statistics and the theory of universal coding.

1. Formal Definition and Minimax Regret Optimality

Given a statistical model class $\{p(x^n; \theta) : \theta \in \Theta\}$ over sequences $x^n$ in a sample or data space $\mathcal{X}^n$ , the normalized maximum likelihood density (or probability mass) is defined by

$p_{\mathrm{NML}}(x^n) = \frac{p(x^n; \hat{\theta}(x^n))}{C_n}$

where $\hat{\theta}(x^n)$ is the maximum likelihood estimator (MLE) for data $x^n$ , and the normalization $C_n$ (parametric complexity) is

$C_n = \int_{y^n \in \mathcal{X}^n} p(y^n; \hat{\theta}(y^n)) \, dy^n$

(or $\sum$ over $\mathcal{X}^n$ if discrete). The associated stochastic complexity or code-length is

$x^n$ 0

This form extends directly to continuous (Lebesgue) or Riemannian volume measures as appropriate.

The NML code-length achieves minimax pointwise regret: $x^n$ 1 with $x^n$ 2 being the unique solution (Suzuki et al., 2024, Li, 2023, Barron et al., 2014).

2. Discrete and Continuous Model Formulations

Discrete Models

$x^n$ 3 with $x^n$ 4 finite or countable.
NML code-length:

$x^n$ 5

Example: Bernoulli, multinomial, and categorical families yield explicit multinomial coefficient representations and enable recurrence-based computation of the normalizer (Boullé et al., 2016, Kobayashi et al., 2024).

Continuous Models and the Coarea Theorem

$x^n$ 6; $x^n$ 7 absolutely continuous.
Naïvely replacing summations by integrals is invalid:
- The level sets $x^n$ 8 have Lebesgue measure zero.
- Jacobian correction is required.
The coarea formula from geometric measure theory resolves this:

$x^n$ 9

where $\mathcal{X}^n$ 0 is the pushforward density of the MLE, integrating $\mathcal{X}^n$ 1 over the corresponding level sets (Suzuki et al., 2024).

Practical Impact

The normalization term can now be computed by a $\mathcal{X}^n$ 2-dimensional integral over parameter space instead of an $\mathcal{X}^n$ 3-dimensional data integral.
This establishes the correctness of the "integral-of-MLE-density-over-parameter-space" method for continuous models under mild regularity (Lipschitzness, nondegenerate Jacobian), not just asymptotically (Suzuki et al., 2024).

3. Analytic, Asymptotic, and Algorithmic Techniques

Asymptotic Expansion

For regular $\mathcal{X}^n$ 4-parameter exponential family models, asymptotic expansion yields

$\mathcal{X}^n$ 5

where $\mathcal{X}^n$ 6 is the Shannon entropy (or differential entropy for continuous case), and $\mathcal{X}^n$ 7 is the Fisher information matrix. The complexity penalty thus refines both BIC and two-part MDL through the explicit Fisher integral (Li, 2023, Hirai et al., 2012, Boullé et al., 2016).

Computational Methods

Discrete models: Binomial and multinomial complexities admit closed forms or recurrences (e.g., Kontkanen–Myllymäki algorithm) (Boullé et al., 2016, Kobayashi et al., 2024).

Continuous/exponential families: Asymptotic Laplace-type approximations, Fourier-analytic integral representations (including non-asymptotic forms for exponential families via characteristic function and partition function techniques) (Suzuki et al., 2018).

Divergence in Continuous Models: RNML ("renormalized NML") and LNML ("NML with luckiness") modify the denominator by restricting the domain or by weighting with a prior ('luckiness'), ensuring finite normalization even for unbounded parameter spaces such as multivariate normal models (Miyaguchi, 2017, Alipourfard et al., 2018, Hirai et al., 2012).

Geometric Generalization: On Riemannian manifolds, the Rm-NML generalizes the concept with intrinsic measure, achieving invariance under coordinate transformations, essential for manifolds such as hyperbolic spaces (Fukuzawa et al., 29 Aug 2025).

4. Exact, Finite-Sample, and Mixture Representations

NML can be represented as a (possibly signed) Bayesian mixture over a sufficient set of parameter points, constructed to exactly reproduce the maximized likelihood and normalizer (Barron et al., 2014). This representation speeds marginalization and sequential prediction, although signed weights are algebraic artifacts.

Key results:

For exponential families, the mixture representation revolves around solving

$\mathcal{X}^n$ 8

for all possible MLEs $\mathcal{X}^n$ 9, where $p_{\mathrm{NML}}(x^n) = \frac{p(x^n; \hat{\theta}(x^n))}{C_n}$ 0 is the Kullback–Leibler divergence.

Fast sequential coding and predictive updating is enabled, reducing computational cost from exponential in data size to linear in the sufficient statistic's range (Barron et al., 2014).

5. Specialized Applications and Model Selection Implications

Model Selection via MDL Principle

NML code-length is central to MDL-based model selection. In practice, the model minimizing $p_{\mathrm{NML}}(x^n) = \frac{p(x^n; \hat{\theta}(x^n))}{C_n}$ 1 is selected. For exponential families, maximum entropy models, PCA rank estimation, clustering (via GMM), and Bayesian network learning, minimization over $p_{\mathrm{NML}}(x^n) = \frac{p(x^n; \hat{\theta}(x^n))}{C_n}$ 2 directly implements the MDL paradigm (Hirai et al., 2012, Pandey et al., 2012, Tavory, 2018, Alipourfard et al., 2018).

Causal Inference

NML-based stochastic complexity is deployed in causal discovery, e.g., selecting between latent confounding and direct causality via model comparison among NML code-lengths, with rigorous guarantees of statistical consistency (Kobayashi et al., 2024).

Statistical Physics

In model comparison between canonical and microcanonical ensembles (e.g., maximum entropy subject to hard vs soft constraints), NML code-length reveals the non-equivalence of ensembles through parametric complexity differences, with consequences for per-node description length in the thermodynamic limit (Giuffrida et al., 2023).

Bayes-NML Connections

Mixture codes with least-favorable priors (e.g., Jeffreys prior) can asymptotically match NML code-length for regular models.
NML is Bayes-optimal for uniform priors in microcanonical models and for Jeffreys prior in single-parameter canonical models, but deviates with extensive constraints (Li, 2023, Giuffrida et al., 2023).

6. Technical Subtleties, Limitations, and Extensions

In unbounded continuous models, $p_{\mathrm{NML}}(x^n) = \frac{p(x^n; \hat{\theta}(x^n))}{C_n}$ 3 diverges unless renormalization or restriction (domain/prior weighting) is introduced; LNML and RNML are principled solutions, optimizing a tilted minimax regret (Miyaguchi, 2017, Alipourfard et al., 2018).
Asymptotic approximations (Laplace, Stirling) are valid under regularity but may misestimate parametric complexity in small-sample or boundary cases (Li, 2023).
For non-Euclidean data spaces (e.g., Riemannian, hyperbolic manifolds), the coordinate-invariant Rm-NML formulation incorporates the manifold's metric via the volume element, ensuring the geometric correctness of parametric complexity (Fukuzawa et al., 29 Aug 2025).
Fourier-based constructions yield non-asymptotic formulas—where the partition function is analytic—especially for exponential families (Suzuki et al., 2018).

7. Summary Table: Key NML Code-Length Formulae

Context (Model class)	General NML Code-Length	Parametric Complexity Term
Discrete	$p_{\mathrm{NML}}(x^n) = \frac{p(x^n; \hat{\theta}(x^n))}{C_n}$ 4	$p_{\mathrm{NML}}(x^n) = \frac{p(x^n; \hat{\theta}(x^n))}{C_n}$ 5
Continuous (Euclidean)	$p_{\mathrm{NML}}(x^n) = \frac{p(x^n; \hat{\theta}(x^n))}{C_n}$ 6	$p_{\mathrm{NML}}(x^n) = \frac{p(x^n; \hat{\theta}(x^n))}{C_n}$ 7
Continuous (Coarea-pushforward)	$p_{\mathrm{NML}}(x^n) = \frac{p(x^n; \hat{\theta}(x^n))}{C_n}$ 8	$p_{\mathrm{NML}}(x^n) = \frac{p(x^n; \hat{\theta}(x^n))}{C_n}$ 9: MLE pushforward density
Exponential family (asymptotic)	$\hat{\theta}(x^n)$ 0	$\hat{\theta}(x^n)$ 1
Riemannian manifold	$\hat{\theta}(x^n)$ 2	intrinsic, coordinate-invariant parametric complexity

All code-length formulas strictly separate model fit (negative log-marginal likelihood at the MLE) from the (typically model-dependent) parametric complexity term, which captures the worst-case excess regret and quantifies Occam’s penalty in model selection.

In conclusion, the normalized maximum likelihood code-length provides a unified, information-theoretic, and decision-theoretic basis for model selection and universal lossless compression. Subtleties in the continuous case, now resolved via the coarea formula and MLE pushforward, cement the mathematical foundation of the method for general models (Suzuki et al., 2024). As a result, NML is central both to the rigorous understanding and to the practical implementation of the minimum description length principle (Li, 2023, Alipourfard et al., 2018).