Luckiness-weighted NML

Updated 6 April 2026

LNML is a generalized form of Normalized Maximum Likelihood that incorporates a non-negative luckiness weight, ensuring a proper minimax regret solution.
The method regularizes models, particularly in continuous or high-dimensional settings, by modifying the likelihood with a luckiness function to avoid divergence.
Practical applications of LNML include multivariate normal models, discrete memoryless sources, and ridge-regression-like scenarios, bridging traditional NML and Bayesian techniques.

Luckiness-weighted Normalized Maximum Likelihood (LNML) is a generalized universal distribution extending the normalized maximum likelihood (NML) to parametric models where NML is ill-defined or divergent, particularly continuous or high-capacity settings. LNML introduces a non-negative "luckiness" or weight function over the parameter space, regularizing the model and ensuring well-posedness of the minimax regret solution. LNML appears in statistical inference, coding theory, model selection, and recent advances in regularized estimation and high-dimensional settings.

1. Formal Definition and Core Properties

Given a parametric family $\mathcal{M} = \{p(x;\theta):\theta\in\Theta\}$ and a sample $x^n \in \mathcal{X}^n$ , standard NML is defined as

$\bar p^{\mathrm{NML}}_n(x^n) = \frac{\max_{\theta\in\Theta} p(x^n;\theta)}{C_n}, \quad C_n = \int_{\mathcal{X}^n}\max_{\theta\in\Theta}p(x^n;\theta)d\mu(x^n).$

If $C_n$ diverges (e.g., for Gaussian models), NML is not defined. LNML replaces the maximum likelihood in both numerator and denominator with a luckiness-weighted form using a weight (luckiness) function $\pi(\theta)>0$ : $\bar p^{\mathrm{LNML}}_n(x^n) = \frac{\max_{\theta\in\Theta} [p(x^n;\theta)\pi(\theta)]}{C_n(\pi)}, \quad C_n(\pi) = \int_{\mathcal{X}^n} \max_{\theta\in\Theta}[p(x^n;\theta)\pi(\theta)] d\mu(x^n).$ With $\pi(\theta)\equiv 1$ , LNML reduces to ordinary NML. LNML is the unique pointwise minimax solution to the regret function

$R_\pi(q;p,x^n) = \log \frac{p(x^n)\pi(\theta)}{q(x^n)},$

so that

$\bar p^{\mathrm{LNML}}_n(x^n) = \arg\min_{q:\int q=1} \max_{\theta, x^n} R_\pi(q;p,x^n).$

LNML always yields a proper (normalized) distribution assuming integrability of the weighted likelihood.

2. Minimax Regret, Asymptotics, and Interpretations

The regret of LNML under the luckiness-weighted regime is defined as

$R^{L}(x) = -\log p_{\mathrm{LNML}}(x) + \log p_{\theta^*(x)}(x) = \log C(L) - \log L(\theta^*(x)),$

where $x^n \in \mathcal{X}^n$ 0. The worst-case regret is

$x^n \in \mathcal{X}^n$ 1

so LNML achieves constant regret determined by the log normalization term. For regular (smooth) parametric families of dimension $x^n \in \mathcal{X}^n$ 2, the asymptotic expansion of $x^n \in \mathcal{X}^n$ 3 under the Laplace method is

$x^n \in \mathcal{X}^n$ 4

where $x^n \in \mathcal{X}^n$ 5 is the Fisher information. This yields the same leading minimax $x^n \in \mathcal{X}^n$ 6 growth as NML for appropriate $x^n \in \mathcal{X}^n$ 7.

3. Construction and Examples in Key Parametric Families

3.1 Multivariate Normal Models

For observations $x^n \in \mathcal{X}^n$ 8, with $x^n \in \mathcal{X}^n$ 9 the standard $\bar p^{\mathrm{NML}}_n(x^n) = \frac{\max_{\theta\in\Theta} p(x^n;\theta)}{C_n}, \quad C_n = \int_{\mathcal{X}^n}\max_{\theta\in\Theta}p(x^n;\theta)d\mu(x^n).$ 0-Gaussian likelihood, LNML uses a conjugate-like prior as luckiness: $\bar p^{\mathrm{NML}}_n(x^n) = \frac{\max_{\theta\in\Theta} p(x^n;\theta)}{C_n}, \quad C_n = \int_{\mathcal{X}^n}\max_{\theta\in\Theta}p(x^n;\theta)d\mu(x^n).$ 1 with $\bar p^{\mathrm{NML}}_n(x^n) = \frac{\max_{\theta\in\Theta} p(x^n;\theta)}{C_n}, \quad C_n = \int_{\mathcal{X}^n}\max_{\theta\in\Theta}p(x^n;\theta)d\mu(x^n).$ 2, $\bar p^{\mathrm{NML}}_n(x^n) = \frac{\max_{\theta\in\Theta} p(x^n;\theta)}{C_n}, \quad C_n = \int_{\mathcal{X}^n}\max_{\theta\in\Theta}p(x^n;\theta)d\mu(x^n).$ 3, $\bar p^{\mathrm{NML}}_n(x^n) = \frac{\max_{\theta\in\Theta} p(x^n;\theta)}{C_n}, \quad C_n = \int_{\mathcal{X}^n}\max_{\theta\in\Theta}p(x^n;\theta)d\mu(x^n).$ 4. This choice ensures convergence of the maximization and the normalization integrals. The resulting LNML has a closed-form: $\bar p^{\mathrm{NML}}_n(x^n) = \frac{\max_{\theta\in\Theta} p(x^n;\theta)}{C_n}, \quad C_n = \int_{\mathcal{X}^n}\max_{\theta\in\Theta}p(x^n;\theta)d\mu(x^n).$ 5 with explicit formulas for $\bar p^{\mathrm{NML}}_n(x^n) = \frac{\max_{\theta\in\Theta} p(x^n;\theta)}{C_n}, \quad C_n = \int_{\mathcal{X}^n}\max_{\theta\in\Theta}p(x^n;\theta)d\mu(x^n).$ 6 as weighted MAP estimators, and $\bar p^{\mathrm{NML}}_n(x^n) = \frac{\max_{\theta\in\Theta} p(x^n;\theta)}{C_n}, \quad C_n = \int_{\mathcal{X}^n}\max_{\theta\in\Theta}p(x^n;\theta)d\mu(x^n).$ 7 involving special functions (multivariate Gamma).

3.2 Discrete Memoryless Sources (DMS)

With categorical probabilities $\bar p^{\mathrm{NML}}_n(x^n) = \frac{\max_{\theta\in\Theta} p(x^n;\theta)}{C_n}, \quad C_n = \int_{\mathcal{X}^n}\max_{\theta\in\Theta}p(x^n;\theta)d\mu(x^n).$ 8 and luckiness $\bar p^{\mathrm{NML}}_n(x^n) = \frac{\max_{\theta\in\Theta} p(x^n;\theta)}{C_n}, \quad C_n = \int_{\mathcal{X}^n}\max_{\theta\in\Theta}p(x^n;\theta)d\mu(x^n).$ 9, the LNML numerator becomes

$C_n$ 0

for counts $C_n$ 1 ( $C_n$ 2), and normalization is via summing over count vectors. For $C_n$ 3 (“Jeffreys” luckiness), leading order regret matches NML, with a different $C_n$ 4 offset.

3.3 Linear Regression with $C_n$ 5 Luckiness

For linear regression with Gaussian errors and a ridge-like luckiness, $C_n$ 6, LNML in the supervised predictive version (LpNML) yields not only consistent regularization but a predictive distribution that can be computed exactly as a shifted Gaussian, blending in-sample interpolation with conservative extrapolation in under-determined cases (Bibas et al., 2022).

4. Algorithmic and Theoretical Insights

LNML density and its normalization constant typically require inner maximization and outer integration—often intractable in high dimensions. For penalized empirical risk minimization, let $C_n$ 7 be a penalty (luckiness), then LNML code length is

$C_n$ 8

To address computational challenges, analytic upper bounds ("uLNML") were derived (Miyaguchi et al., 2018). Under smoothness and convexity assumptions, $C_n$ 9 is computable in closed-form for $\pi(\theta)>0$ 0/ $\pi(\theta)>0$ 1 penalties and is uniformly close to the true LNML code length—enabling practical parameter selection (MDL-RS) in high dimensions.

5. Role of Luckiness and Incorporation of Side Information

The choice of luckiness function $\pi(\theta)>0$ 2 (or $\pi(\theta)>0$ 3, $\pi(\theta)>0$ 4) encodes prior beliefs, regularization, or auxiliary information:

Priors or pseudo-priors: Conjugate-like weights encode beliefs or enforce lower bounds (e.g., for covariance matrices).
Side information: Incidental data or null-hypothesis values can be incorporated by constructing $\pi(\theta)>0$ 5 to bias estimators towards plausible regions or to allow finite regret when ordinary NML diverges.
Regularization: $\pi(\theta)>0$ 6 luckiness directly yields ridge regression behavior, regularizing hypothesis space and controlling complexity in high-capacity settings (Bibas et al., 2022).
Statistical evidence measures: LNML enables discrimination information $\pi(\theta)>0$ 7 to assess evidence for model comparison, with asymptotic calibration and robustness to multiplicity (Bickel, 2010).

6. Connections to NML, Bayesian Mixtures, and $\pi(\theta)>0$ 8-NML

LNML unifies and interpolates various universal coding/prediction paradigms:

With $\pi(\theta)>0$ 9, ordinary NML is recovered.
LNML is a limiting case of $\bar p^{\mathrm{LNML}}_n(x^n) = \frac{\max_{\theta\in\Theta} [p(x^n;\theta)\pi(\theta)]}{C_n(\pi)}, \quad C_n(\pi) = \int_{\mathcal{X}^n} \max_{\theta\in\Theta}[p(x^n;\theta)\pi(\theta)] d\mu(x^n).$ 0-NML as $\bar p^{\mathrm{LNML}}_n(x^n) = \frac{\max_{\theta\in\Theta} [p(x^n;\theta)\pi(\theta)]}{C_n(\pi)}, \quad C_n(\pi) = \int_{\mathcal{X}^n} \max_{\theta\in\Theta}[p(x^n;\theta)\pi(\theta)] d\mu(x^n).$ 1 with a prior $\bar p^{\mathrm{LNML}}_n(x^n) = \frac{\max_{\theta\in\Theta} [p(x^n;\theta)\pi(\theta)]}{C_n(\pi)}, \quad C_n(\pi) = \int_{\mathcal{X}^n} \max_{\theta\in\Theta}[p(x^n;\theta)\pi(\theta)] d\mu(x^n).$ 2 (Bondaschi et al., 2022). Mixture/Bayesian codes correspond to $\bar p^{\mathrm{LNML}}_n(x^n) = \frac{\max_{\theta\in\Theta} [p(x^n;\theta)\pi(\theta)]}{C_n(\pi)}, \quad C_n(\pi) = \int_{\mathcal{X}^n} \max_{\theta\in\Theta}[p(x^n;\theta)\pi(\theta)] d\mu(x^n).$ 3, LNML to $\bar p^{\mathrm{LNML}}_n(x^n) = \frac{\max_{\theta\in\Theta} [p(x^n;\theta)\pi(\theta)]}{C_n(\pi)}, \quad C_n(\pi) = \int_{\mathcal{X}^n} \max_{\theta\in\Theta}[p(x^n;\theta)\pi(\theta)] d\mu(x^n).$ 4.
LNML (with appropriate luckiness) sits at the edge of the trade-off between mixture predictors and hard minimax NML, providing a uniform constant-regret bound and avoiding divergences present in unconstrained NML.

7. Practical Applications and Model Selection

LNML is particularly relevant in:

Model selection under MDL: Explicit, finite-complexity penalty even for non-compact or continuous models, augmenting MDL-based criteria (Miyaguchi, 2017, Miyaguchi et al., 2018).
High-dimensional penalty selection: The MDL-RS method leverages analytic uLNML to select regularization parameters efficiently, outperforming cross-validation and BIC/AIC in highly-redundant/high-dimensional regimes (Miyaguchi et al., 2018).
Prediction under distribution shift: LNML/LpNML provides bounded, calibrated regret and improved robustness over empirical risk minimization, with improved out-of-distribution characteristics (Bibas et al., 2022).
Robust inference and multiple comparisons: LNML-based discrimination information adapts robustly when integrating diverse side information and controls error rates in high-throughput testing (Bickel, 2010).

References:

"Normalized Maximum Likelihood with Luckiness for Multivariate Normal Distributions" (Miyaguchi, 2017)
"Statistical inference optimized with respect to the observed sample for single or multiple comparisons" (Bickel, 2010)
"High-dimensional Penalty Selection via Minimum Description Length Principle" (Miyaguchi et al., 2018)
"Alpha-NML Universal Predictors" (Bondaschi et al., 2022)
"Beyond Ridge Regression for Distribution-Free Data" (Bibas et al., 2022)