Bayesian Model Comparison

Updated 17 December 2025

Bayesian model comparison is a framework that evaluates competing probabilistic models by computing marginal likelihoods and Bayes factors, balancing model fit against complexity.
The Hyvärinen score is a proper scoring rule that remains invariant to prior scaling, making it useful when dealing with improper or vague priors and complex models.
Advanced computational techniques such as sequential Monte Carlo methods enable robust and efficient model comparisons even in high-dimensional or latent variable settings.

Bayesian model comparison is the framework within Bayesian statistics for quantifying the relative plausibility of competing probabilistic models given observed data. The cornerstone of this framework is the calculation of marginal likelihoods (or model evidences) and their ratios—Bayes factors—which balance model fit and complexity in a principled manner. Modern research has advanced both the theoretical foundation and computational methodology for Bayesian model comparison, spanning improper-prior settings, high-dimensional and hierarchical models, simulation-based inference, and approaches beyond the classical Bayes factor.

1. Foundations: Marginal Likelihoods, Bayes Factors, and Interpretability

The marginal likelihood, or model evidence, for a statistical model $M$ with parameter vector $\theta$ and data $y$ is defined as

$p(y\mid M) = \int p(y\mid\theta,M)\,p(\theta\mid M)\,d\theta,$

where $p(\theta\mid M)$ is the prior and $p(y\mid\theta,M)$ the likelihood. For two models $M_1$ and $M_2$ , their relative support is quantified by the Bayes factor: $\mathrm{BF}_{12} = \frac{p(y\mid M_1)}{p(y\mid M_2)}.$ The logarithm of the Bayes factor admits several interpretations, e.g., as a sum of out-of-sample log predictive scores, and its asymptotic behavior under repeated sampling can be linked to Kullback–Leibler divergences between the true data-generating process and each model. Specifically, under regularity, $\frac{1}{T}\log\mathrm{BF}_{12}$ converges almost surely to $\mathrm{KL}(p_\star\|\!M_2) - \mathrm{KL}(p_\star\|\!M_1)$ , providing a frequentist justification for Bayesian model comparison in the so-called M-open setting (Shao et al., 2017).

A fundamental issue arises when marginal likelihoods are computed under vague or improper priors: the evidence and Bayes factor become arbitrarily defined up to multiplicative (respectively, additive in log) constants, undermining their operational meaning (Bartlett’s paradox). This motivates alternative scoring-based approaches or specialized workflows that restore interpretability.

2. Proper Scoring Rules and the Hyvärinen Score

Proper scoring rules generalize the log-score ( $-\log p(y)$ ) used in standard Bayesian model selection. The Hyvärinen score, defined for a twice-differentiable density $p_\theta(x)$ as

$S_H(p_{\theta},x) = -\Delta_x\log p_{\theta}(x)-\tfrac12\|\nabla_x\log p_{\theta}(x)\|^2,$

is homogeneous of order 0, i.e., invariant to normalization of $p_\theta(x)$ , making it well-suited for situations with improper or vague priors.

Prequentially, for sequence data $y_{1:T}$ , the accumulated Hyvärinen score for a model $M$ is given by

$\mathcal{H}_T(M) = \sum_{t=1}^T S_H\!\bigl(p_M(\cdot\mid y_{1:t-1}),\,y_t\bigr),$

and models are compared via $\mathcal{H}_T(M_2) - \mathcal{H}_T(M_1)$ , termed the “H-factor.” In the large-sample limit, this difference converges to the difference in Fisher (score) divergences between the two models and the data-generating process, providing strong consistency results in both iid and state-space settings (Shao et al., 2017). The Hyvärinen score thus retains operational interpretability even when the Bayes factor does not, and can be empirically computed using sequential Monte Carlo (SMC/SMC ${}^2$ ) estimators of the relevant score moments in both tractable and intractable likelihood models.

3. Theoretical Properties and Asymptotics

Both the Bayes factor and the Hyvärinen-based H-factor select the model closest in divergence (KL for the Bayes factor; Fisher divergence for the Hyvärinen score) to the true data law in the M-open world. For non-nested models, the large-sample behavior is governed by these divergence differences. For nested models, the Hyvärinen score introduces a log-sample-size penalty analogous to the BIC penalty in the log-Bayes factor,

$\mathcal{H}_T(M_2)-\mathcal{H}_T(M_1) = \delta_{21}\log T + o_p(\log T),$

where $\delta_{21}$ is the difference in model dimensionality (Shao et al., 2017).

The Hyvärinen score is also constructible for discrete data using finite-difference approximations, preserving its local, homogeneous, and proper properties. This enables its deployment in discrete latent diffusion models and other settings relevant to population dynamics and stochastic process modeling.

4. Practical Algorithms and Computation

Calculation of the Hyvärinen score for each observation entails evaluating the derivatives of the log predictive distribution, which under regularity conditions may be replaced by posterior expectations and variances: $S_H\bigl(p(\cdot\mid y_{1:t-1}),y_t\bigr) = \sum_{k=1}^{d_y} \left\{ 2\; \mathbb{E}_{t-1}\left[\partial^2_k\log p(y_t\mid\theta)\right] - \left(\mathbb{E}_{t-1}[\partial_k\log p(y_t\mid\theta)]\right)^2 \right\},$ where $\mathbb{E}_{t-1}$ denotes expectation under the posterior $p(\theta\mid y_{1:t-1})$ . For nonlinear non-Gaussian state-space models, the required derivatives can be written as expectations over both parameters and latent states, amenable to SMC ${}^2$ estimators (Shao et al., 2017).

Algorithmic workflow (for parametric models):

Initialize parameter particles and weights.
For each time $t$ , update weights with predictive likelihoods, resample as needed, and apply MCMC steps.
At each time, estimate the required partial derivatives and accumulate the Hyvärinen score increment.

For hierarchical or doubly-intractable models, the Hyvärinen score can be implemented using particle filters nested within SMC schemes, provided one can differentiate the state and observation models.

5. Robustness to Priors and Empirical Examples

The main operational advantage of the Hyvärinen score is its invariance to the normalization of $p(\theta)$ , which ensures well-posed model comparison even when improper priors are used. This property was illustrated in numerical studies comparing (i) Lévy-driven stochastic volatility models with intractable transition and measurement densities and (ii) population diffusion models for red kangaroo counts with discrete latent states and improper uniform priors. In both cases, the Hyvärinen criterion selected the true or simplest model and was stable to prior scaling, whereas the Bayes factor’s numerical value could be arbitrarily shifted by changes to the prior normalizing constant (Shao et al., 2017).

6. Comparison with Classical Evidence-Based Criteria

While Bayes factors remain preferred when priors are proper and scientific prior-sensitivity is warranted, the H-factor and similar proper scoring-rule criteria present a pragmatic alternative when (i) uninformative or diffuse priors are unavoidable, (ii) marginal likelihoods are not available in closed form or are intractable, or (iii) nested models with subtle parameter redundancy are compared. The Hyvärinen score offers consistency, robustness, and efficient computability via SMC methods, positioning it as a rigorous solution for Bayesian model comparison beyond the constraints of the traditional evidence framework.

7. Summary and Outlook

Bayesian model comparison underpins model selection in modern statistical science, with the Bayes factor as its canonical tool. However, improper priors, high dimensionality, latent variable models, and simulation-based scenarios necessitate principled alternatives. The Hyvärinen score provides a robust, consistent, and computationally efficient approach based on homogeneous proper scoring rules, circumventing the limitations imposed by prior normalization. Empirical and theoretical results confirm its suitability for a wide range of parametric and nonparametric modeling contexts, especially in doubly-intractable or improper-prior scenarios, such as those encountered in population dynamics and stochastic volatility modeling (Shao et al., 2017).

PDF Markdown Chat (Pro)

References (1)

Bayesian model comparison with the Hyvärinen score: computation and consistency (2017)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Bayesian Model Comparison.