Papers
Topics
Authors
Recent
Search
2000 character limit reached

Quotient-NML (qNML): Model Selection Framework

Updated 6 April 2026
  • qNML is an information-theoretic framework that uses normalized maximum likelihood to achieve parameter- and prior-free model selection and structure learning.
  • It quantifies evidence through ratios of NML scores, providing asymptotically reliable discrimination information for hypothesis testing and multiple comparisons.
  • qNML underpins Bayesian network learning with decomposable, efficient scores that maintain minimax optimality and robust performance in high-dimensional settings.

Quotient-NML (qNML) is an information-theoretic penalized likelihood framework for statistical model comparison and structure learning, rooted in the minimax optimality of normalized maximum likelihood (NML) coding. It provides a parameter-free, prior-free, and sample-optimal approach for hypothesis testing, discrimination quantification, and model selection, particularly in contexts such as multiple comparisons and Bayesian network structure learning. qNML evaluates the strength of evidence by forming ratios of NML scores for competing models, generalizes to weighted likelihoods for cases where standard NML is undefined, and admits efficient, decomposable formulations for high-dimensional applications.

1. Theoretical Foundations

The NML density, defined for a parametric model M={f(⋅∣θ):θ∈Θ}M = \{f(\cdot\mid\theta):\theta\in\Theta\} and observed data xn=(x1,…,xn)x^n=(x_1,\ldots,x_n), is given by

fˉ0(xn)=f(xn∣θ^(xn))Cn,Cn=∫Xnf(un∣θ^(un)) dun,\bar f_0(x^n) = \frac{f(x^n\mid\hat\theta(x^n))}{C_n}, \qquad C_n = \int_{\mathcal X^n} f(u^n\mid\hat\theta(u^n))\,du^n,

where θ^(xn)\hat\theta(x^n) denotes the MLE for sample xnx^n and CnC_n is the normalization (regret) constant ensuring minimax worst-case log-loss performance (Bickel, 2010).

qNML quantifies the evidence in favor of one model M1M_1 over another M0M_0 by the ratio

qNML(xn)=fˉ0,1(xn)fˉ0,0(xn)\mathrm{qNML}(x^n) = \frac{\bar f_{0,1}(x^n)}{\bar f_{0,0}(x^n)}

where each fˉ0,j\bar f_{0,j} is the NML for model xn=(x1,…,xn)x^n=(x_1,\ldots,x_n)0 with parameter space xn=(x1,…,xn)x^n=(x_1,\ldots,x_n)1. The log of this ratio,

xn=(x1,…,xn)x^n=(x_1,\ldots,x_n)2

is termed the discrimination information (DI) and represents the difference in NML code-lengths between models. In contrast to Bayes factors, DI does not require prior specification, and its minimax property does not average over unobserved samples.

2. Key Properties and Interpretability

qNML inherits several desirable theoretical guarantees:

  • Minimax observed-sample optimality: Each NML component achieves minimax regret for the observed data (Bickel, 2010).
  • Asymptotic reliability: For any fixed threshold xn=(x1,…,xn)x^n=(x_1,\ldots,x_n)3, the probability of misleading evidence—DI favoring the incorrect model—decays exponentially as sample size increases:

xn=(x1,…,xn)x^n=(x_1,\ldots,x_n)4

  • Prior-free operation: No prior distributions are needed for nuisance or interest parameters, in contrast to procedures like the Bayes factor.
  • Strong evidence calibration: DI can favor a simple null hypothesis, behaves predictably under increasing sample size, and satisfies vanishing misleading evidence criteria.
  • Score equivalence (in structure learning): In Bayesian network applications, qNML assigns identical scores to Markov equivalent DAGs, crucial for search algorithms working over equivalence classes (Silander et al., 2024).

3. Weighted Quotient-NML and Extensions

When the standard NML is undefined or inapplicable (such as for sufficient statistics or conditional models), qNML can be generalized using weighted likelihoods: xn=(x1,…,xn)x^n=(x_1,\ldots,x_n)5 leading to the normalized maximum weighted likelihood (NMWL)

xn=(x1,…,xn)x^n=(x_1,\ldots,x_n)6

A weighted-qNML ratio and its log-DI then result by analogy, extending the applicability of DI to a broad class of models and settings (Bickel, 2010).

Empirical studies, such as the eight SAT-site comparison and proteomics protein feature analyses, demonstrate the robustness of DI to the choice of weights, especially when sample sizes are moderate or large.

4. Practical Computation and Approximations

For low-dimensional or discrete models, the normalizing constant xn=(x1,…,xn)x^n=(x_1,\ldots,x_n)7 can be computed exactly. In higher dimensions or with continuous parameters, Laplace approximation yields

xn=(x1,…,xn)x^n=(x_1,\ldots,x_n)8

with xn=(x1,…,xn)x^n=(x_1,\ldots,x_n)9 as the Fisher information matrix. The code-length consequently approximates to

fˉ0(xn)=f(xn∣θ^(xn))Cn,Cn=∫Xnf(un∣θ^(un)) dun,\bar f_0(x^n) = \frac{f(x^n\mid\hat\theta(x^n))}{C_n}, \qquad C_n = \int_{\mathcal X^n} f(u^n\mid\hat\theta(u^n))\,du^n,0

In Bayesian network learning, the Szpankowski–Weinberger closed-form approximation provides an efficient and numerically accurate surrogate for NML regret terms: fˉ0(xn)=f(xn∣θ^(xn))Cn,Cn=∫Xnf(un∣θ^(un)) dun,\bar f_0(x^n) = \frac{f(x^n\mid\hat\theta(x^n))}{C_n}, \qquad C_n = \int_{\mathcal X^n} f(u^n\mid\hat\theta(u^n))\,du^n,1 with fˉ0(xn)=f(xn∣θ^(xn))Cn,Cn=∫Xnf(un∣θ^(un)) dun,\bar f_0(x^n) = \frac{f(x^n\mid\hat\theta(x^n))}{C_n}, \qquad C_n = \int_{\mathcal X^n} f(u^n\mid\hat\theta(u^n))\,du^n,2 and fˉ0(xn)=f(xn∣θ^(xn))Cn,Cn=∫Xnf(un∣θ^(un)) dun,\bar f_0(x^n) = \frac{f(x^n\mid\hat\theta(x^n))}{C_n}, \qquad C_n = \int_{\mathcal X^n} f(u^n\mid\hat\theta(u^n))\,du^n,3, enabling constant-time evaluation even in large models (Silander et al., 2024).

5. Applications in Model Comparison and Structure Learning

qNML provides a general framework for rigorous model comparison, hypothesis testing, and network structure selection:

  • Multiple hypothesis testing: The DI statistic offers a calibrated and robust measure of evidence strength across multiple comparisons, with empirical results indicating little need for further multiplicity adjustments when sample sizes are moderate (Bickel, 2010).
  • Bayesian network structure learning: qNML defines the score for a network fˉ0(xn)=f(xn∣θ^(xn))Cn,Cn=∫Xnf(un∣θ^(un)) dun,\bar f_0(x^n) = \frac{f(x^n\mid\hat\theta(x^n))}{C_n}, \qquad C_n = \int_{\mathcal X^n} f(u^n\mid\hat\theta(u^n))\,du^n,4 on data fˉ0(xn)=f(xn∣θ^(xn))Cn,Cn=∫Xnf(un∣θ^(un)) dun,\bar f_0(x^n) = \frac{f(x^n\mid\hat\theta(x^n))}{C_n}, \qquad C_n = \int_{\mathcal X^n} f(u^n\mid\hat\theta(u^n))\,du^n,5 of size fˉ0(xn)=f(xn∣θ^(xn))Cn,Cn=∫Xnf(un∣θ^(un)) dun,\bar f_0(x^n) = \frac{f(x^n\mid\hat\theta(x^n))}{C_n}, \qquad C_n = \int_{\mathcal X^n} f(u^n\mid\hat\theta(u^n))\,du^n,6 as the sum over nodes:

fˉ0(xn)=f(xn∣θ^(xn))Cn,Cn=∫Xnf(un∣θ^(un)) dun,\bar f_0(x^n) = \frac{f(x^n\mid\hat\theta(x^n))}{C_n}, \qquad C_n = \int_{\mathcal X^n} f(u^n\mid\hat\theta(u^n))\,du^n,7

where fˉ0(xn)=f(xn∣θ^(xn))Cn,Cn=∫Xnf(un∣θ^(un)) dun,\bar f_0(x^n) = \frac{f(x^n\mid\hat\theta(x^n))}{C_n}, \qquad C_n = \int_{\mathcal X^n} f(u^n\mid\hat\theta(u^n))\,du^n,8 is the data for node fˉ0(xn)=f(xn∣θ^(xn))Cn,Cn=∫Xnf(un∣θ^(un)) dun,\bar f_0(x^n) = \frac{f(x^n\mid\hat\theta(x^n))}{C_n}, \qquad C_n = \int_{\mathcal X^n} f(u^n\mid\hat\theta(u^n))\,du^n,9, θ^(xn)\hat\theta(x^n)0 the number of its states, θ^(xn)\hat\theta(x^n)1 the number of parental configurations, and the regret difference serves as a universal penalty. This score is decomposable, hyperparameter-free, and consistent.

Empirical benchmarks demonstrate that qNML achieves low structural Hamming distance (SHD) to ground truth, robust predictive accuracy, and often yields the most parsimonious networks among compared methods (notably BIC, BDeu, and factorized NML), with minimal tuning or computational overhead (Silander et al., 2024).

6. Implementation Guidelines and Empirical Performance

Implementation of qNML in network learning workflows involves:

  • Calculating multinomial MLE-based log-likelihoods for each variable conditioned on its parent set.
  • Evaluating regret term differences using the Szpankowski–Weinberger approximation.
  • Aggregating decomposable local qNML scores for global model selection.

Due to its node-wise decomposability and lack of adjustable hyperparameters, qNML integrates seamlessly into existing BN structure-search algorithms (greedy search, dynamic programming, etc.), matching the asymptotic running time of BIC and factorized NML.

Empirical results indicate that qNML:

  • Excels in model parsimony relative to fNML and BDeu.
  • Maintains predictive log-likelihood close to or surpassing competing criteria at moderate to large sample sizes.
  • Produces stable and interpretable network structures with the lowest performance variance across varying data sizes.

7. Context, Significance, and Recommendations

qNML advances information-theoretic model selection by combining the minimax foundation of NML with operational tractability and statistical resilience. Its prior-free, hyperparameter-free nature distinguishes it from Bayesian model selection techniques. In complex inference settings—such as high-dimensional multiple comparisons or Bayesian network learning—qNML provides tuning-free, optimally calibrated, and interpretable model selection, with strong asymptotic guarantees and robust empirical performance (Bickel, 2010, Silander et al., 2024).

A plausible implication is that for practitioners concerned with multiple comparisons or network modeling, qNML offers a principled criterion with desirable theoretical and computational properties, and its penalty form converges to the classic BIC asymptotically, ensuring consistency and efficiency in large-sample regimes.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Quotient-NML (qNML).