Bootstrap Goodness-of-Fit Testing

Updated 30 January 2026

Bootstrap GoF testing is a resampling-based procedure that assesses model fit when test statistic distributions are analytically intractable or depend on nuisance parameters.
It employs parametric, wild, and weighted bootstrap methods to approximate critical values, ensuring proper type I error control even in complex settings.
Practical implementations require careful calibration, simulation of synthetic datasets, and efficient computation to handle high-dimensional, functional, or network models.

Bootstrap goodness-of-fit (GoF) testing is a suite of procedures that leverages resampling techniques to assess model fit, particularly when the null distribution of a test statistic is analytically intractable or depends on unknown nuisance parameters. These methods are now foundational in empirical process theory, high-dimensional statistical inference, functional data analysis, and modern regression modeling. The following sections provide a technical overview with emphasis on state-of-the-art methodologies, statistical guarantees, and practical implementation.

1. Foundations and Rationale

Bootstrap GoF tests exploit resampling consistent with the null model (either parametric or nonparametric) to approximate the critical values of a GoF statistic whose sampling distribution is not universal or does not admit a tractable tabulation. In a typical setting with a parametric or semiparametric model featuring unknown nuisance parameters, the plug-in principle and bootstrap resampling together offer a route to valid, often nonparametric, calibration. This framework is essential whenever maximum-likelihood or method-of-moments estimation induces a test statistic whose null law depends intricately on estimated parameters or the entire data generating process.

2. Construction of Test Statistics

GoF statistics eligible for bootstrap calibration encompass a wide array of empirical process and regression-based functionals. Core archetypes include:

Kolmogorov–Smirnov/Cramér–von Mises Statistics: Quantify the empirical discrepancy between parametric or semiparametric fits and observed cumulative quantities, either in the sup-norm or L²-norm.
Residual Prediction (RP) functionals: In high-dimensional regression, such as the Lasso or OLS, scaled residuals are used as pseudoresponses to probe for unexplained structure by regressing against covariates, transformations, or additional predictors. The resultant test statistic is typically a prediction error proxy such as residual sum of squares or an out-of-bag error (Shah et al., 2015).
Functional Data Statistics: In the FLM/FLMFR setting, doubly projected Cramér–von Mises (PCvM) norms of marked empirical processes are constructed from projections onto leading functional principal components or other basis expansions (García-Portugués et al., 2020, García-Portugués et al., 2019).
Graph Functionals: For random networks, integrated deviations of subgraph counts (e.g., triangles, stars) from their null means/variances, as in subgraph-count GoF tests (Brune et al., 2023).

Examples of key formulation include the RP statistic

$T = f(r, X_{\rm all}),$

where $r$ denotes normalized residuals and $f$ is a prediction-error functional, or, in the FLMFR,

$\mathrm{PCvM}_n = \int_{\Pi} [R_n(u, \gamma_{\mathcal X}, \gamma_{\mathcal Y})]^2 \;\mathrm{d}\Pi.$

3. Bootstrap Calibration Algorithms

Parametric Bootstrap (Classical Approach)

The critical value is simulated by generating synthetic datasets under the null model, with nuisance parameters replaced by their estimators:

Fit null model to data, obtain parameter estimates.
Compute observed test statistic (e.g., $T^0$ ).
For $b=1,...,B$ $b = 1, ..., B$ :
- Simulate new data under the fitted model.
- Recompute parameter estimates and the test statistic $T^{(b)}$ .
Approximate the p-value using empirical quantiles:

$p_{\rm boot} = \frac{1 + \#\{b: T^{(b)} \ge T^0\}}{B+1}$

Reject if $p_{\rm boot} \le \alpha$ .

Wild and Weighted Bootstraps

For heteroscedastic data or empirical process-based tests, recentering and rescaling via random weights allow for bootstrap simulation without explicit resampling of the data. A typical scheme, especially in high-dimensional or functional data, involves wild bootstrap on model residuals—multiplying residuals by i.i.d. symmetric random variables (Rademacher or Mammen’s two-point law)—followed by recomputation of fitted models and statistics in each replicate (Kojadinovic et al., 2012, García-Portugués et al., 2012, García-Portugués et al., 2019, García-Portugués et al., 2020).

Advanced U-statistic Corrections and Composite Nulls

In tests based on degenerate U-statistics or involving parameter estimation, the naive bootstrap needs correction for the additional stochastic variability induced by estimation. Recent advances formalize such bootstraps, for instance, the Kernel Stein Discrepancy GoF test, where the bootstrap includes second-order Taylor corrections to account for parameter estimation effects (Brueck et al., 26 Oct 2025).

4. Theoretical Guarantees and Type I Error Control

Uniform Validity and Consistency

Under regularity conditions—identifiability, smoothness, and consistent parameter estimation—the bootstrap distribution converges in the bounded Lipschitz metric to the true null distribution of the statistic. In high-dimensional linear models, control of the type I error hinges critically on ancillary or near-ancillary properties of residuals (exact under OLS, approximate under Lasso with compatibility and sparsity conditions) (Shah et al., 2015).

Piecewise, for empirical process and U-statistic-based test statistics, the limiting distributions are often infinite (weighted sums of chi-squareds), but the bootstrap (parametric or wild) consistently estimates the null quantiles up to stochastic order $o(1)$ , provided influence functions (score, gradient, Hessian) are well-behaved and the Donsker property holds as needed (Kojadinovic et al., 2012, García-Portugués et al., 2020, Brueck et al., 26 Oct 2025). Bootstrap-based GoF tests accommodate dimension and parameter growth as long as the corresponding empirical processes remain tight.

5. Practical Implementation and Recommendations

Regime	Typical B	Statistic type	Notes
Low-dim, OLS	50–100	Ancillary-based	Exchangeable
High-dim/Lasso	200–500	Prediction error	Bootstrap Lasso penalty carefully (Shah et al., 2015)
Functional LM	500–1000	PCvM, wild boot	Pre-compute geometry; use FPCR-L1S (García-Portugués et al., 2020, García-Portugués et al., 2019)
Graphical	500–2000	Subgraph count	Relies on fast graph simulation (Brune et al., 2023)
General Empirical Proc.	≥1000	KS/CvM, weighted	MP versus PB for large n (Kojadinovic et al., 2012)

Penalty and parameter tuning: Use cross-validation for regularization parameters; for group Lasso or functional principal component regression (FPCR-L1S), the "1 SE" rule is robust.
Basis truncation and matrix computations in FLM/FLMFR: Use leading principal components explaining ≥99% of the variance; pre-compute matrix structures for trace/quadratic forms.
Wild bootstrap weights: Mammen’s law or Rademacher preferred for stable variance and higher-moment control.
Number of bootstraps (B): 50–100 suffices for pointwise critical values; 200–2000 stabilize p-values for omnibus or aggregated tests.
Computational complexity: Many methods necessitate O(n²) to O(n³) per statistic or O(n³) overall for large n; parallelization, warm starts, and analytic hat matrices (for refitting) reduce computational load.
Nonparametric vs parametric bootstrap: Nonparametric is more robust to model misfit; parametric is preferred for null laws depending critically on fitted parameters.

6. Simulation Studies and Empirical Robustness

Extensive simulations in high-dimensional settings, functional data contexts, and network analysis confirm:

Level accuracy: Bootstrap tests attain nominal size in finite samples, outperforming asymptotic-approximation rivals for moderate n, especially for statistics whose null distributions are non-universal or depend intricately on the model.
Power: Tests are consistent against generic alternatives—mean-misspecification, heteroscedasticity, nonlinearity, omitted variables, or structural alternatives in graphs or functional models (Shah et al., 2015, García-Portugués et al., 2020, Brune et al., 2023).
Omnibus detection: Modern bootstrap GoF tests are tuned for broad sensitivity and do not presuppose a specific alternative.
Comparative performance: In functional settings, the regularized PCvM with FPCR-L1S and wild bootstrap maintains correct type I error and high power in both additive and non-additive scenarios, outperforming kernel and martingale-difference correlation-based competitors in irregular contexts (García-Portugués et al., 2019, García-Portugués et al., 2020).

7. Extensions and Advanced Applications

Bootstrap GoF methodologies extend to:

Conditional distribution regression: Assessing full distributional regression models for both continuous and discrete outcomes using CDF process-based statistics with a parametric bootstrap for calibration (Kremling et al., 2024).
Composite nulls and parameter estimation: Full correction for estimation-induced degeneracy in U-statistics, crucial for Stein Discrepancy and likelihood-ratio type tests (Brueck et al., 26 Oct 2025).
Robust and nonparametric settings: Weighted or multiplier bootstraps provide computationally efficient alternatives for high-dimensional models, large-scale graphs, and spatially correlated data (Kojadinovic et al., 2012, Meilán-Vila et al., 2024).
Network and graph models: Parametric bootstrapping is used for testing ER-models or block-model alternatives via resampling random graphs with fitted parameters (Brune et al., 2023, Garrard, 2017).

References

"Goodness of fit tests for high-dimensional linear models" (Shah et al., 2015)
"Goodness-of-fit tests for functional linear models based on integrated projections" (García-Portugués et al., 2020)
"A goodness-of-fit test for the functional linear model with functional response" (García-Portugués et al., 2019)
"Goodness of fit testing based on graph functionals for homogenous Erdös Renyi graphs" (Brune et al., 2023)
"KSD Aggregated Goodness-of-fit Test" (Schrab et al., 2022)
"Composite goodness-of-fit test with the Kernel Stein Discrepancy and a bootstrap for degenerate U-statistics with estimated parameters" (Brueck et al., 26 Oct 2025)
"Goodness-of-fit testing based on a weighted bootstrap" (Kojadinovic et al., 2012)

These works collectively establish a rigorous theoretical and methodological framework for bootstrap GoF testing in modern applied statistics and probability.