Bayes Factors in Model Comparison
- Bayes factors are defined as the ratio of marginal likelihoods, integrating data fit and prior structure to compare competing models.
- They are computed using methods like Laplace approximation, power posterior techniques, bridge sampling, and the Savage–Dickey density ratio.
- Bayes factors offer nuanced evidence quantification, bridging Bayesian and frequentist frameworks and informing robust decision theory.
A Bayes factor (BF) is a central statistical quantity in Bayesian model comparison and hypothesis testing. It quantifies the relative predictive adequacy of two competing models or hypotheses, integrating both fit to the observed data and complexity/parameter space structure. Bayes factors are foundational in modern statistical inference, evidence synthesis, and model selection across disciplines.
1. Mathematical Definition and Fundamental Properties
Let and denote two rival hypotheses or models for observed data . The Bayes factor in favor of over is
where is the marginal likelihood or evidence under , integrating the likelihood over the prior for all model parameters: A indicates evidence for , and supports (Isi et al., 2022, Mulder et al., 27 Nov 2025, Schad et al., 2021). The Bayes factor multiplies prior odds to yield posterior odds: Interpretation guidelines (Kass & Raftery): (“anecdotal”), (“positive”), (“strong”), (“very strong”) (Mulder et al., 27 Nov 2025).
2. Computational Methods and Practical Estimation
Evaluating marginal likelihoods for complex models is challenging, particularly as model dimension increases. Several practical computational strategies include:
- Analytic/Laplace Approximation: For models where the likelihood near the maximum is approximately Gaussian and parameter priors are uniform over finite ranges, marginal likelihoods can be evaluated in closed form using the model's maximum-likelihood fit, covariance matrix, and prior widths. The resulting Bayes factor includes an “Occam penalty” via posterior/prior volume ratio, refining simple BIC approximations (Dunstan et al., 2020).
- Power Posterior (Thermodynamic Integration): For high-dimensional models (e.g., linear mixed models), the evidence can be computed as a one-dimensional integral over a “temperature” parameter:
where denotes the power posterior at temperature . Grid quadrature and MCMC chains are used at a ladder of values (Calvo et al., 2022).
- Bridge Sampling: General-purpose estimator for normalizing constants, used for evidence computation by relating samples from posterior and bridge distributions (Schad et al., 2021).
- Savage–Dickey Density Ratio (SDDR): For nested models, the Bayes factor reduces to a prior-to-posterior density ratio at the test point. This is particularly efficient in the context of supermodel constructions (Mootoovaloo et al., 2016).
- Prior-Free and Cross-Validated Alternatives: Cross-Validation Bayes factors (CVBFs) and Geometric Intrinsic Bayes Factors (GIBFs) avoid explicit prior specification by splitting data into training and validation sets, computing geometric means of likelihood ratios, and calibrating sample-split size for consistency (“Bridge Rule”: for -parameter models) (Wang et al., 2020).
- Minimal-Summary/Closed Forms: For and tests, analytic Bayes factors can be computed using only summary statistics (e.g., the “Pearson Bayes factor”) (Faulkenberry, 2020, Faulkenberry et al., 2022).
3. Prior Sensitivity, Occam's Razor, and Hierarchical Inference
The value of a Bayes factor exhibits strong sensitivity to the specification of priors—especially for parameters in regions where the data are uninformative (“prior-volume effect” or “Occam penalty”). For nested models, broadening priors in directions the data cannot constrain can arbitrarily amplify the Occam penalty and thus the Bayes factor, yielding falsely high support for overly simple models even with data generated from the alternative (Isi et al., 2022).
Multiplying Bayes factors across independent datasets can lead to spurious evidence when each single-event factor is prior-volume dominated. Isi et al. demonstrate that combined Bayes factors from independent events, each yielding “ambivalent” individual Bayes factors, can become exponentially decisive in the wrong direction if prior widths are not calibrated to the actual population or measurement precision.
Hierarchical models, which treat the population-level parameter distribution as unknown and infer it from the data, circumvent this pathology. When population hyperparameters are integrated explicitly, posterior mass is concentrated where the data provide evidence, suppressing the influence of unconstrained prior regions and yielding robust Bayes-factor inference (Isi et al., 2022).
4. Decision-Theoretic Context and Frequentist Connections
Bayes factors provide graded evidence rather than a forced binary decision, in contrast to frequentist hypothesis tests. They can, however, be embedded directly into Bayesian decision-theoretic frameworks by specifying loss functions for different decision-action/hypothesis combinations. The optimal action balances the Bayes factor, prior odds, and the ratio of Type I to Type II error losses. A robust decision rule can be formulated based on an interval for the loss ratio, leading to transparent reporting or, if necessary, a decision to withhold action due to insufficient robustness (Schwaferts et al., 2021, Schad et al., 2021).
A frequentist interpretation arises by recognizing that the Bayes factor is Neyman–Pearson-optimal when error rates are averaged over priors (i.e., among all tests with a given expected Type I error, the BF-threshold test maximizes expected power) (Fowlie, 2021). In special cases (simple hypotheses), the Bayes factor test coincides with the classical likelihood-ratio test; under monotone likelihood-ratio conditions, it recovers the UMP test. Bayesian large-sample arguments also demonstrate that under correct model specification, BF-based procedures control classical error rates through e-value connections (Mulder et al., 27 Nov 2025).
5. Extensions: Bayes Factor Functions, Surfaces, and Summary Methods
Contemporary usage goes beyond reporting a single Bayes factor. Several generalizations yield richer inferential summaries:
- Bayes Factor Functions (BFFs): For hypothesis testing based on classical statistics, BFFs consider the BF as a function of effect size or noncentrality parameter (). They display evidence profiles across a continuum of hypothesized effects and support aggregation across studies. Under a suitable non-local prior (e.g., normal moment or inverse-moment), BFFs can be computed in closed or semi-closed form, with favorable frequentist operating characteristics (enhanced Type I error control and power) (Johnson et al., 2022, Datta et al., 2023, Datta et al., 20 Jun 2025, Datta et al., 13 Mar 2025).
- Support Curves: The Bayes-factor function provides a level set of evidence over the parameter space and enables construction of “support intervals” and maximum evidence estimates via inversion, paralleling the confidence-interval logic of frequentist methods. This approach unifies estimation and testing under the BF framework (Pawel, 14 Mar 2024).
- Bayes Factor Surfaces: In high-energy physics and cosmology, BF surfaces quantify evidence for or against signal hypotheses across two-dimensional grids of phenomenological parameters (e.g., mass and cross-section in WIMP searches). These surfaces can be used for reinterpretation, combining evidence across experiments, and robust visualization of exclusion/discovery claims, and enjoy coverage guarantees due to properties such as the Kerridge theorem (Fowlie, 22 Jan 2024).
- Analytic Bayes Factors from Minimal Summaries: Closed-form expressions for repeated measures/ANOVA or pairwise -tests (“Pearson Bayes factor”) allow indexation of evidence from minimal statistics (F, t, and main dfs) without access to raw data, facilitating evidence quantification and meta-analytic synthesis (Faulkenberry, 2020, Faulkenberry et al., 2022).
6. Applications and Impact in Evidence Synthesis and Model Assessment
Bayes factors are prominent in meta-analysis for their ability to quantify sequentially updated, coherent, and symmetric evidence for or against effects. Due to their e-value property, they permit valid optional stopping and error control in cumulative evidence scenarios (Mulder et al., 27 Nov 2025). For complex hierarchical and mixed models, computational schemes such as the power-posterior enable practical estimation of BFs in longitudinal or high-dimensional settings (Calvo et al., 2022).
In the cognitive and psychological sciences, recommended Bayesian workflows explicitly rely on robust Bayes-factor estimation pipelines—incorporating prior and posterior predictive checks, simulation-based calibration for estimator bias, and decisions guided by domain-appropriate utility functions. Well-calibrated BFs contribute to transparent, repeatable scientific inference and clear connections between evidence and decision (Schad et al., 2021).
7. Summary Table: Core Bayes Factor Concepts
| Topic | Key Formula/Principle | Reference |
|---|---|---|
| Marginal Likelihood (Evidence) | (Isi et al., 2022, Mulder et al., 27 Nov 2025) | |
| Bayes Factor | (Mulder et al., 27 Nov 2025) | |
| Occam Penalty | Prior-volume effect in ; see Eq. (3) in (Isi et al., 2022) | (Isi et al., 2022) |
| Hierarchical Bayes Factor | Hierarchical inference on ensemble parameters; Eq. (6)-(7) in (Isi et al., 2022) | (Isi et al., 2022) |
| Decision Rule (Hypothesis Test) | Action via loss ratio: choose if | (Schwaferts et al., 2021) |
| Bridge Rule (CVBF) | for training size in cross-validated BF | (Wang et al., 2020) |
| Bayes Factor Surface | over parameter grid | (Fowlie, 22 Jan 2024) |
| Bayes Factor Function (BFF) | : BF as function of standardized effect | (Johnson et al., 2022, Datta et al., 20 Jun 2025) |
| Neyman–Pearson Optimality | maximizes expected power at fixed type I error | (Fowlie, 2021) |
Bayes factors provide a general, calibrated, and flexible framework for quantifying and accumulating statistical evidence. They bridge Bayesian and frequentist paradigms, unify testing and estimation, and are extensible via support curves, BFFs, and hierarchical/model-averaged extensions for robust evidence synthesis. Their interpretability, aggregation rules, and principled handling of model complexity make them a central tool in contemporary statistical methodology (Isi et al., 2022, Mulder et al., 27 Nov 2025, Johnson et al., 2022, Datta et al., 2023, Pawel, 14 Mar 2024, Datta et al., 20 Jun 2025).