- The paper shows that bootstrap-based p-values may be anti-conservative under alternative hypotheses, leading to misleading error rates.
- It employs the functional delta method to demonstrate that classical bootstrap consistency is guaranteed only under the null hypothesis.
- The study emphasizes the need for alternative resampling strategies to achieve valid inference in scenarios where standard bootstrapping fails.
Critical Examination of the Validity of Bootstrapping Tests Outside the Null Hypothesis
Introduction
This paper rigorously investigates the theoretical underpinnings of the bootstrap procedure with particular emphasis on its behavior when the null hypothesis does not hold. The central issue addressed concerns the validity of bootstrapped inferential methods—especially tests—which are typically justified asymptotically under the null, but whose properties under alternatives are much less analyzed. This is a significant gap, as practitioners often interpret bootstrapped p-values and confidence sets as reliable even under departures from the null hypothesis.
Overview and Key Results
The analysis proceeds by formalizing conditions under which the bootstrap distribution approximates the sampling distribution of statistics, focusing on general functionals and their sample estimators. The primary technical result is that classical bootstrap consistency results are only guaranteed under the null, and that in a range of scenarios, the bootstrap can provide systematically misleading inference under alternative hypotheses.
Specifically, the authors prove that for many functionals (including many widely used in statistical testing), the resampling distribution produced by the bootstrap is not, in general, a valid approximation of the finite-sample or asymptotic distribution under alternatives. This arises from the fact that at points off the null, the relevant limiting distribution may depend sharply on nuisance parameters or features of the underlying data-generating process not captured by the plug-in principle. The exposition makes extensive use of functional delta method tools, specifically differentiability properties of estimators, and considers both parametric and nonparametric settings.
Contradictory and Strong Claims
A notable claim is established: bootstrap-based p-values and critical values can be heavily anti-conservative, and may systematically misrepresent the actual type I and type II error rates under alternatives. This is not only shown theoretically, but explicit counter-examples are provided, encompassing popular test statistics.
Furthermore, the paper clarifies the limitations of naive "off-the-shelf" application of bootstrap testing procedures. It shows that practitioners should not expect bootstrap-based inference to be valid except under very restrictive circumstances, primarily when testing simple hypotheses, or in parametric models where plug-in estimators are unbiased and sufficiently regular.
Practical and Theoretical Implications
On a practical level, these results sharply delimit the class of inferential questions to which bootstrapping yields trustworthy control of error rates. In many ML and statistics applications where the null hypothesis is false or poorly specified, standard bootstrap confidence measures may be invalid. This compels reconsideration of established workflows in econometrics, biostatistics, and high-dimensional inference, where resampling is ubiquitous.
Theoretically, the findings highlight the need for alternative resampling strategies or new theoretical developments—possibly incorporating conditioning or pivotal statistics—that can yield valid inference under both the null and alternatives.
Future Directions
This work suggests several avenues for future inquiry:
- Developing resampling schemes tailored for robustness under alternatives, perhaps by adapting the bootstrap to local alternatives or incorporating higher-order influence functions.
- Systematic characterization of functionals and tests for which the bootstrap remains valid off the null, enabling practitioners to identify settings where it is safe.
- Exploration of conditional inference and subsampling as potentially more robust alternatives.
Conclusion
This paper presents a thorough and technically precise critique of the bootstrap's validity outside the null hypothesis, establishing both broad limitations and specific pitfalls of its use in hypothesis testing. As bootstrapping remains a foundational tool in applied statistics and ML, a clear understanding of its inferential boundaries is necessary. This analysis will likely inform both methodological development and the practical interpretation of resampling-based inference going forward.
Reference: "Bootstrapping not under the null?" (2512.10546)