Jeffreys–Lindley Paradox: Bayesian vs Frequentist
- The Jeffreys–Lindley paradox is a statistical phenomenon where Bayesian analysis increasingly favors the point null hypothesis as sample size grows, despite frequentist rejection via p-values.
- It highlights the divergence between methodologies by contrasting how p-values remain fixed at significance thresholds while Bayes factors tend to infinity under a constant prior setup.
- The paradox urges researchers to adopt interval nulls and calibrated priors, aligning statistical testing more closely with practical relevance and effect size interpretation.
The Jeffreys–Lindley paradox is a central result in statistical theory illustrating a fundamental asymptotic divergence between frequentist hypothesis test conclusions and Bayesian posterior inference, specifically in the context of point null versus composite alternative hypothesis testing. Emerging from the contrasting behaviors of -values and Bayes factors as the sample size increases while the significance level and prior structure are held fixed, the paradox raises critical questions about practical and philosophical interpretations of "evidence" in statistical inference frameworks. Its implications span theoretical statistics, experimental design, and the foundational debate between Bayesian and frequentist methodologies.
1. Precise Statement and Mathematical Formulation
Consider the testing of a simple point null hypothesis against a composite alternative under a normal location model: with observations independently drawn from , known. The frequentist test utilizes the statistic
and rejects at level if . The two-sided -value is given by . In a Bayesian framework, prior mass is placed on (Dirac delta), while under , a diffuse prior such as is employed: The Bayes factor for versus is then
where , denote the marginal likelihoods under and respectively.
The paradox manifests when, as at fixed and fixed prior parameters, the -value remains at the threshold given "just significant" data, while the Bayes factor and thus the Bayesian posterior probability of tends to 1. Explicitly, for data with ,
indicating ever-increasing Bayesian support for despite the frequentist criterion continuously rejecting (Lovric, 28 Nov 2025, Wijayatunga, 18 Mar 2025, Cousins, 2013).
2. Distinction from Bartlett’s Anomaly and Conceptual Clarifications
A persistent misconception has conflated the Jeffreys–Lindley paradox with what is properly termed Bartlett’s anomaly, wherein the prior variance under diverges () at fixed sample size. Both phenomena result in under different asymptotic regimes: the Jeffreys–Lindley paradox is driven by at fixed , while Bartlett’s anomaly is driven by at fixed . These situations possess distinct mathematical structures and implications, and require separate resolutions (Lovric, 28 Nov 2025).
3. Mechanistic Origins and Statistical versus Practical Significance
At its core, the paradox is a consequence of tension between statistical and practical significance. The frequentist method assesses the observed gap relative to the rapidly decreasing standard error, leading to "statistical significance" for arbitrarily small deviations as increases. In contrast, the Bayesian framework, penalizing the alternative for spreading prior mass over a large parameter space, increasingly favors the point null as data accumulates near —a property termed the "Ockham’s razor" effect (Wijayatunga, 18 Mar 2025, Cousins, 2013). This is further compounded under large by the fact that even negligible differences become significant under the frequentist protocol, whereas the Bayesian Bayes factor continues to reward parsimony unless the observed effect is substantial relative to region covered by the alternative prior.
4. Extensions and Implications in Testing and Estimation
The paradox critically impacts both hypothesis testing and interval estimation. With positive prior mass allocated to a point null, the Bayesian posterior can concentrate so strongly on that point as (for fixed Bayes factor or posterior odds target), that credible intervals constructed from the posterior mixture distribution may become undefined for certain credibility levels. This "incredibility gap" means that for some values of , no central credible interval exists—a phenomenon exclusive to Bayesian procedures with point-mass mixture posteriors, not mirrored in frequentist confidence intervals (Campbell et al., 2022).
The table illustrates credible interval definability:
| Posterior Model | Mixture Posterior with Point Mass | Purely Continuous Posterior |
|---|---|---|
| Interval Issue | Credibility gap may arise | Credible intervals always exist |
Frequentist confidence intervals, by contrast, retain well-defined coverage properties for all , emphasizing fundamental inferential discrepancies in the presence of point-mass priors.
5. Resolutions: Interval Nulls and Alternative Bayesian Calibrations
A central theme in current research is that the only principled resolution to the Jeffreys–Lindley paradox is the reformulation of the hypothesis from a point null to an interval (or "practical equivalence") null where captures the minimum effect size of scientific interest (Lovric, 28 Nov 2025, Wijayatunga, 18 Mar 2025). With interval nulls, both frequentist and Bayesian evidence accumulation are commensurate: frequentist equivalence testing uses confidence interval overlap with , and the Bayesian approach compares continuous prior probabilities over the interval null and its complement. In this setting, the paradox disappears; both approaches cohere and reflect practical significance rather than a measure-zero point null.
Alternative calibrations have been proposed:
- Use of finite, scientifically motivated priors for to control the size of Bayes factor penalties (Villa et al., 2015),
- Cake priors that diffuse at rates matched to the number of parameters, leading to automatic BIC-like penalties and Chernoff consistency (asymptotically zero type I and II errors) (Ormerod et al., 2017),
- Predictive model selection (AIC- or cross-validation-based criteria) in place of postdictive Bayes factors to maintain detection resolution and avoid the paradox for large (LaMont et al., 2016).
6. Broader Impact, Domain-Specific Manifestations, and Remaining Issues
The paradox has significant practical implications in fields where large samples are common and sharp nulls are tested, including high energy physics and precision metrology. For example, in particle physics, established practice often requires significance for discovery declarations, but as the paradox demonstrates, fixed thresholds on the -value or -score can be at odds with Bayesian conclusions as sample sizes become enormous and systematic uncertainty dominates inference (Cousins, 2013). Similar discordance has been documented in phase estimation with optical interferometry: Bayesian conclusions may depend strongly on prior width, while the frequentist test signals almost certain "discovery," revealing the extent to which experimental context and scientific prior knowledge must be incorporated to avert misleading inference (Mauri et al., 2015).
Even with "objective" or diffuse priors justified by lack of knowledge, the paradox reveals a mathematical impossibility: truly -indifferent inference requires improper (scale-invariant) priors that cannot be normalized, so any proper prior with truncation inevitably introduces -dependence and can only delay, not remove, the paradox (Fowlie, 2020).
7. Summary of Recommendations and Theoretical Insights
Established recommendations for practitioners seeking to avoid the Jeffreys–Lindley paradox include:
- Replace point nulls with scientifically meaningful interval nulls whenever possible, aligning statistical testing with practical relevance (Lovric, 28 Nov 2025, Wijayatunga, 18 Mar 2025).
- Employ finite, problem-specific priors for alternatives to prevent automatic dominance by , calibrating or prior mass to practical effect sizes (Villa et al., 2015, Mauri et al., 2015).
- Use predictive or cross-validation criteria rather than pure marginal likelihoods or postdictive Bayes factors in scenarios where maximizing detection resolution and minimizing paradoxical inconsistencies is paramount (LaMont et al., 2016).
- Exercise caution in interpreting model-averaged credible intervals with point-mass priors, as standard interval procedures may become undefined for large (Campbell et al., 2022).
- Conduct sensitivity analyses with respect to prior distribution, width, and truncation points, especially when the goal is to align statistical with substantive scientific significance.
The Jeffreys–Lindley paradox highlights the necessity of integrating practical effect sizes into hypothesis testing, the limitations of uncritical use of -values or Bayes factors for point nulls in high-dimensional or large-sample regimes, and the conceptual need for careful prior specification in Bayesian model comparison. These findings reinforce the need for scientific context and problem-specific calibration in modern statistical inference (Lovric, 28 Nov 2025, Wijayatunga, 18 Mar 2025, Cousins, 2013, Campbell et al., 2022, Mauri et al., 2015, LaMont et al., 2016, Ormerod et al., 2017, Villa et al., 2015, Fowlie, 2020).