Design-Based Cross-Validation

Updated 26 February 2026

Design-based cross-validation is a framework that extends classical methods by incorporating experimental design elements such as blocking, randomization, and complex sampling.
It employs techniques like Horvitz–Thompson weighting, leave-one-out validation, and kernel herding to correct bias, variance, and instability in predictive error estimation.
The method is applied in survey sampling, computer experiments, and designed experiments to enable robust model evaluation and selection in non-i.i.d. settings.

Design-based cross-validation is a framework that generalizes classical cross-validation methods by explicitly accounting for the experiment’s design structure—such as blocking, randomization, complex sampling, or adaptive selection—when estimating predictive error, selecting models, or constructing validation strategies. It subsumes adaptations in computer experiments, survey sampling, and classical designed experiments, yielding methodology that addresses bias, variance, and algorithmic brittleness arising from naïve cross-validation when the data-generating mechanism departs from the i.i.d. assumption or when the design structure is critical for valid inference. Key applications include adaptive sampling in Gaussian process emulation, risk estimation under finite-population sampling, sequential design for computer experiments, and construction of efficient validation/test sets for surrogate modeling.

1. Motivations and Fundamentals

Standard cross-validation assumes i.i.d. data with test sets representative of unseen populations, an assumption disrupted by structured designs, complex sampling, or deterministic space-filling experiments. In finite-population or survey settings, or when analysis involves structured experimental designs (e.g., factorial, split-plot, or adaptive sequential designs), ignoring the design can result in systematically biased prediction error estimates, invalid inference, or unstable model selection.

In design-based cross-validation frameworks, randomization and sampling indicators—not the outcomes—are treated as stochastic, so all inference (e.g., out-of-sample risk or validation error) is conducted with respect to the known sampling or assignment mechanism. This approach provides unbiased or consistent estimates of true out-of-sample error, corrects for test-train distribution mismatch, and leverages the structure of the design for more robust and generalizable model assessment (Zhang et al., 2023, Weese et al., 17 Jun 2025).

2. Methodological Variants and Algorithmic Structure

2.1. Survey and Finite-Population Settings

In survey analysis and similar finite-population frameworks, design-based cross-validation treats the finite set $\{(y_i, x_i): i \in U\}$ as fixed and all stochasticity as owing to the sampling mechanism $p(s)$ (possibly unequal probability or complex sample designs). For model $\mu$ fit on a training sample $s_1 \subset s$ , the design-based risk is

$\tau(\mu) = E_{p} \left[ D(s;\mu) \right] , \quad D(s; \mu) = \sum_{i \in R} \{\mu(x_i,s) - y_i\}^2$

with $R = U \setminus s$ . Cross-validation error is estimated using Horvitz–Thompson–type weights to obtain unbiased estimators for out-of-sample predictive error (Zhang et al., 2023).

Algorithmic steps:

Specify the probability sampling scheme $p(s)$ , obtain inclusion probabilities $\pi_i$ .
Split observed sample $s$ into train/test sets according to a secondary design $q(s_1|s)$ .
Compute prediction residuals on the test sets.
Construct design-unbiased estimators via weighting (e.g., $\sum_{i \in s_2} (\pi_{2i}^{-1} - 1) e_i^2$ ).
Aggregate over splits, apply Rao–Blackwellization, and provide variance estimation for valid inference.

2.2. Computer Experiments and Adaptive Sampling

In computer experiments or surrogate modeling using Gaussian process (GP) emulators, design-based cross-validation is used both for adaptive data acquisition and for robust validation set construction.

GP Emulator Adaptive Sampling:

The method uses expected squared leave-one-out (ES-LOO) error at each observed design point to identify high-impact areas for further sampling (Mohammadi et al., 2020):

Fit a GP to current data $\{x_i, f(x_i)\}_{i=1}^n$ .
For each $x_i$ , calculate

$E(x_i) = \frac{E\big[ (Z_{n,-i}(x_i) - f(x_i))^2 \big]}{\sqrt{\text{Var}\big[ (Z_{n,-i}(x_i) - f(x_i))^2 \big]}}$

using single-matrix-inverse formulas for computational efficiency.

Emulate the ES-LOO surface by fitting a secondary GP to $\{x_i, \log E(x_i)\}$ .
Formulate a pseudo expected improvement (pseudo-EI) acquisition criterion by multiplying the standard EI by a repulsion function (RF) to enforce global exploration and avoid design clustering.
Select new points (sequential or batch) maximizing pseudo-EI, update, and repeat.

Kernel Herding for Validation Design:

Validation sets for integrated squared error estimation can be constructed via conditional maximum mean discrepancy (MMD) minimization, using kernel herding methods to place validation points in “holes” left by the training design (Pronzato et al., 2021). Reweighting of validation points (e.g., MN $_2$ variant) counteracts overestimation of global error due to large conditional variances away from the training points.

2.3. Classical Designed Experiments

In small, highly structured classical designs, standard $k$ -fold CV can be unreliable due to fixed-design confounding and lack of replication. Design-based adaptations involve:

Preferential use of leave-one-out CV (LOOCV), which respects design structure and maximizes use of the limited data.
Forming validation folds based on blocks or whole plots in the presence of split-plot or blocking structures.
Employing LOOCV-based best-subsets model selection for response surface or screening experiments (Weese et al., 17 Jun 2025).

Empirical findings demonstrate that LOOCV-based model selection is often superior or competitive with $k$ -fold CV and the little bootstrap for both prediction and factor screening in designed experiments with $n < 100$ .

3. Statistical Properties and Theory

Design-based cross-validation estimators possess several key properties:

Unbiasedness/Consistency: Design-based estimators (e.g., Horvitz–Thompson estimators in survey settings) yield unbiased or consistent estimates of true prediction error, provided the design is known and properly incorporated (Zhang et al., 2023).
Variance Estimation: Closed-form variance estimators exist via design-based theory, enabling confidence interval construction.
Exploration-Exploitation Trade-off: In adaptive sampling for computer experiments, combining ES-LOO-based exploitation with repulsion-enforcing exploration yields empirical improvements in model accuracy and space-filling (Mohammadi et al., 2020).
Oracle Properties: In kernel herding-based validation designs, the minimization of conditional MMD ensures rapid convergence to the optimal validation distribution under convexity (Pronzato et al., 2021).

LOO-CV in small, structured designs mitigates overfitting relative to $k$ -fold CV and is robust to instability in model selection (Weese et al., 17 Jun 2025).

4. Empirical Comparisons and Recommendations

Comparative studies across application domains have established the following:

In survey/registry settings: Naïve CV methods can be severely biased under unequal-probability or without-replacement sampling. Design-based approaches correct this bias and allow valid inference for out-of-sample prediction (Zhang et al., 2023).
In computer experiments: ES-LOO-based adaptive sampling and kernel herding-based validation outperform random, Sobol’, or LOOCV-based validation strategies—especially when the objective is accurate global error estimation or efficient emulator learning (Pronzato et al., 2021, Mohammadi et al., 2020).
In designed experiments: LOOCV enables effective model selection and prediction, often outperforming $k$ -fold CV and the little bootstrap in both response-surface and screening regimes, with Gauss-Lasso providing a fast non-CV alternative for factor screening (Weese et al., 17 Jun 2025).

Recommended practices include:

LOOCV when $n < 100$ or the design is highly structured;
Construction of validation/test sets via kernel herding with conditional kernels, followed by variance-adapted reweighting;
Respecting blocking or randomization units when forming CV folds.

5. Connections to Broader Methodological Themes

Design-based cross-validation bridges multiple research domains:

It generalizes classical cross-validation to settings with complex or non-i.i.d. designs, addressing deficiencies in naïve validation and risk estimation frameworks.
In surrogate modeling, it unifies space-filling, adaptive, and error-aware sampling via principled exploitation of cross-validated uncertainties.
In survey inference, it provides a rigorous framework for out-of-sample error estimation, model selection, and ensemble prediction under known complex sampling designs.
For experimental design research, these tools support robust model validation and selection in the presence of aliasing, confounding, and limited replication.

6. Notable Algorithms and Implementation Templates

Below is a concise tabular summary of prominent design-based cross-validation algorithmic templates.

Application	Core Algorithmic Steps	Key References
Survey/fixed population	Sampling design $p(s)$ → Data split $q(s_1\|s)$ → CV with design-unbiased weighting → Rao–Blackwellization → Variance estimation	(Zhang et al., 2023)
GP emulator/adaptive design	Fit GP → Compute fast ES-LOO via Dubrule identities → Emulate ES-LOO surface → Pseudo-EI criterion with repulsion → Sequential/batch sampling	(Mohammadi et al., 2020, Gratiet et al., 2012)
Validation/test set construction	Fit GP covariance → Kernel herding (conditional kernel) for $Z_m$ → Weight correction (MN, MN $_2$ ) → Estimate generalization error	(Pronzato et al., 2021)
DOE model selection	LOOCV-based best-subsets/penalized regression → Fold construction respects blocking/plot → Out-of-sample RMSPE for each model	(Weese et al., 17 Jun 2025)

7. Limitations and Future Directions

Open challenges include theoretical guarantees (e.g., proofs of convergence for adaptive sampling rules such as ES-LOO/pseudo-EI in GP emulation (Mohammadi et al., 2020)), extension to further design structures or to non-Gaussian surrogates, and broader incorporation of design-based validation in community benchmarks. Further studies are warranted to develop rigorous variance estimation for arbitrarily complex design-based cross-validation procedures, and to establish best practices for design-based validation in deep learning and high-dimensional settings.

Design-based cross-validation stands as a unifying methodological principle for validating predictive models and selecting adaptive or test designs when classical assumptions are inadequate, with broad applicability in survey statistics, computer experiments, and structured experimental design.