Cross-Fitting: Sample Splitting

Updated 16 March 2026

Cross-fitting is a statistical method that splits the data into folds to eliminate empirical-process bias from data-adaptive nuisance estimators.
It ensures root-n consistency and asymptotic normality by using out-of-fold evaluations, even in high-dimensional and dependent data scenarios.
The technique underpins methods like doubly robust estimation and meta-learning, improving variance estimation and practical inference in complex models.

Cross-fitting, also known as sample splitting or out-of-fold evaluation, is a statistical methodology designed to eliminate empirical-process bias that arises when estimating target functionals using data-adaptive nuisance parameter estimates. Cross-fitting is now central to high-dimensional and nonparametric inference, particularly in causal machine learning, semiparametric estimation, changepoint detection with flexible models, high-dimensional two-sample testing, and variance-debiased estimation in complex dependence structures.

1. Foundational Definition and Rationale

In the canonical semiparametric or causal inference paradigm, the parameter of interest is often of the form $\theta = \E[f_{\eta_0}(X)]$, where $f_{\eta_0}$ depends on one or multiple nuisance functions $\eta_0$ (such as propensity scores or outcome regressions). Plug-in estimators that use machine-learned $\hat{\eta}$ trained on the entire sample induce a bias term of the form $(P_n - P)[f_{\hat{\eta}} - f_{\eta_0}]$ . This empirical-process bias can dominate stochastic fluctuation, especially with high-dimensional or highly adaptive $\hat{\eta}$ , and may violate the conditions for root- $n$ consistency and valid CLT-based inference (Balkus et al., 15 Jan 2026, Newey et al., 2018, Okasa, 2022, Bia et al., 2020).

Cross-fitting eliminates this bias by splitting the sample into $K$ folds and, for each fold, evaluating $f_{\hat{\eta}^{(-k)}}(x_i)$ using a nuisance function trained only on the complement of that fold. The cross-fitted estimator aggregates the out-of-fold functionals:

$\hat\theta_{\text{CF}} = \frac{1}{n} \sum_{k=1}^K \sum_{i\in I_k} f_{\hat{\eta}^{(-k)}}(X_i)$

where $I_1, ..., I_K$ are the index sets for the $K$ folds. This ensures that $f_{\hat{\eta}^{(-k)}}(X_i)$ and $X_i$ are independent conditional on the training split, neutralizing own-observation bias and permitting use of complex ML methods for $\hat{\eta}$ (Okasa, 2022, Balkus et al., 15 Jan 2026, Zeng, 2022).

2. Cross-Fitting Protocols and Theoretical Guarantees

The essential algorithm for $K$ -fold cross-fitting is as follows:

Partition $\{1,...,n\}$ into $K$ disjoint (roughly equal-sized) folds $I_1,...,I_K$ .
For each $k$ $k$ :
- Train nuisance $\hat{\eta}^{(-k)}$ on $\bigcup_{\ell\neq k} I_\ell$ .
- Compute $f_{\hat{\eta}^{(-k)}}(X_i)$ for $i\in I_k$ .
Aggregate across folds as above.

This protocol applies to doubly robust estimation, meta-learners for CATE, sample selection models, AIPW/TMLE, and many others (Balkus et al., 15 Jan 2026, Qian et al., 2024, Bia et al., 2020, Okasa, 2022, Newey et al., 2018, Ellul et al., 2024), with extensions to three-way cross-fitting (using separate folds for each nuisance and the final regression) when higher-order orthogonality is required (Fisher et al., 2023, McClean et al., 2024).

Main Theoretical Results

Empirical-process bias elimination: Cross-fitting ensures $(P_{S_2} - P)[f_{\hat{\eta}_{S_1}} - f_{\eta_0}] = o_P(1/\sqrt{n})$ under $\sqrt{n}$ -rate CLTs for $P_n$ and mild conditions on nuisance estimation error and dependence structure (Balkus et al., 15 Jan 2026).
Orthogonal (doubly robust) scores: With Neyman-orthogonal estimating equations (as in DML or DR learners), cross-fitting yields root- $n$ consistency and asymptotic normality for $\hat\theta_{\text{CF}}$ , provided $\hat{\eta}$ converges at $o_P(n^{-1/4})$ rates in $L_2$ (Bia et al., 2020, Zeng, 2022, Okasa, 2022, Newey et al., 2018).
Variance estimation: Cross-fitted influence functions (scores evaluated using out-of-fold nuisances) are used for asymptotic variance estimation and Wald-type confidence intervals (Ellul et al., 2024, Bia et al., 2020, Zeng, 2022).
Correlated and dependent data: The method remains valid under weak dependence—including m-dependence, $\beta$ -mixing, timeseries, clusters, spatial/network structures—so long as the number of correlated pairs is $o(n^2)$ at the relevant rate (Balkus et al., 15 Jan 2026, Lunde, 2019).

3. Cross-Fitting Under Correlation, Dependence, and Non-IID Sampling

Although cross-fitting was originally motivated in the IID setting, its applicability extends to dependent data. The core result (Balkus et al., 15 Jan 2026) demonstrates that cross-fitting as if the data were IID (randomly assigning units to folds, ignoring dependence) remains asymptotically valid for debiasing the empirical process under:

A dependent-data CLT for $P_n f_{\eta_0}$
Well-behaved $L_2$ error for nuisance fits
Subquadratic growth of the number of correlated pairs, i.e. $o(n^2)$

This obviates the need for bespoke fold construction in most clustered, time-series, or networked settings. In fact, purposely packing highly correlated units into the same fold typically increases the variance of the empirical process term, and therefore generic K-fold partitioning is recommended for bias elimination (Balkus et al., 15 Jan 2026).

Limitation: If correlation is extremely dense (e.g., long-range dependence where most pairs are correlated), specialized splitting may still be needed. Further, cross-fitting does not address variance estimation in the presence of correlation; cluster-robust or HAC variance estimators must still be used (Balkus et al., 15 Jan 2026, Lunde, 2019).

In randomized experiments under a design-based framework (fixed covariates and outcomes; randomization only in treatment), standard cross-fitting ceases to be unbiased because splitting induces dependence between folds. Conditional cross-fitting employs splits defined conditional on the treatment assignment so that for each fold, the assignment mechanism remains valid and independent across splits; this preserves unbiasedness and efficiency (Lu et al., 21 Aug 2025).

4. Advanced Cross-Fitting Designs: Double/Triple Splitting and Averaging

Double and Triple Cross-Fitting

Double cross-fitting (DCDR)—using independent folds for separate nuisance functions—tightens control of higher-order bias and achieves faster rates and minimax-optimality in non- $\sqrt n$ regimes under Hölder smoothness. Triple (or more generally, $M$ -way) cross-fitting separates folds for each of $M$ nuisance components as well as the main regression, removing both first- and second-order bias terms and accelerating convergence (Newey et al., 2018, Fisher et al., 2023, McClean et al., 2024).

Repeated Splitting and Aggregation

To reduce estimator variance and increase inferential reproducibility, practitioners often repeat the cross-fitting procedure across $B$ random partitions and aggregate (typically by mean or median) the resulting estimator values. Aggregating multiple splits stabilizes estimators and p-values, mitigates the "p-value lottery" seen with a single split, and is formally justified via central limit theorems that account for the dependence structure of overlapping splits (Fava, 7 Nov 2025, Jacob, 2020, Städler et al., 2012).

Aggregation Technique	Main Purpose	Empirical Impact
Mean over $B$ splits	Lowered variance, higher efficiency	Generally reduced MSE
Median over $B$ splits	Robustness to outliers/split artifacts	Estimator less sensitive to tails

5. Application Domains and Empirical Behavior

Cross-fitting is now integral to modern double/debiased machine learning (DML) (Bia et al., 2020, Zeng, 2022), heterogeneous treatment effect meta-learners (e.g., DR-learner, R-learner, T-learner) (Okasa, 2022, Jacob, 2020, Fisher et al., 2023), high-dimensional two-sample testing (Städler et al., 2012), changepoint detection in complex models (Qian et al., 2024), post-selection inference (Kravitz et al., 2019), and variance-debiased evaluation in A/B testing and sequential trials (Kessler et al., 3 Dec 2025).

Empirical findings include:

In large-sample and high-dimensional settings, cross-fitting robustly eliminates overfitting bias, enabling theoretically justified use of highly adaptive or black-box ML models for nuisance fitting (Okasa, 2022, Ellul et al., 2024).
In small samples, repeated or full-sample estimation may outperform due to reduced size of the training set per fold (Okasa, 2022, Jacob, 2020).
When deployed with doubly robust scores (Neyman orthogonality), cross-fitting delivers root- $n$ rates and semiparametric efficiency bounds for a wide variety of estimands (Bia et al., 2020, Newey et al., 2018).
In time-series and spatial contexts, standard cross-fitting maintains inferential validity under mild mixing or dependence conditions and is simpler than block or buffer techniques (Balkus et al., 15 Jan 2026, Lunde, 2019).

6. Limitations, Alternatives, and Open Problems

Variance Estimation: Cross-fitting only debiases point estimates; standard errors must account for cross-sectional, cluster, or time-series dependence using robust estimators separate from the cross-fitting protocol (Balkus et al., 15 Jan 2026, Lunde, 2019, Ellul et al., 2024).
Dense Dependence: When the number of correlated pairs is $O(n^2)$ (e.g., no decay in dependence), cross-fitting as-IID may incur inflated variance. Research into fold designs that explicitly minimize finite-sample variance in these settings is ongoing (Balkus et al., 15 Jan 2026).
Low Sample Regimes & Stability: Where sample sizes are small or cross-fitted nuisances are highly variable, leave-one-out–stable ML algorithms (e.g., bagging, ensemble methods) can recover root- $n$ inference without cross-fitting, provided stability conditions hold (Chen et al., 2022, Chen et al., 11 Feb 2026).
Conditional Cross-Fitting in Experiments: In non-IID randomized experiments, carefully designed conditional cross-fitting ensures unbiasedness and asymptotic normality without assuming a super-population model (Lu et al., 21 Aug 2025).
Changepoint Detection and Model Selection: For complex fitting tasks (e.g., high-dimensional changepoint detection), out-of-sample (cross-fit) loss targeting is essential to avoid degenerate or biased minima induced by overfitting on in-sample loss (Qian et al., 2024).
Hyperparameter Sensitivity and Averaging: Aggregating results over multiple random splits by median is empirically more robust than mean in finite samples, especially in the presence of outlier nuisance fits (Jacob, 2020).

7. Summary Table: Theoretical and Practical Aspects

Aspect	Result/Implication	Reference
Bias removal	Cross-fitting yields $o_P(1/\sqrt n)$ empirical-process bias	(Balkus et al., 15 Jan 2026)
Efficiency	DR cross-fitting achieves semiparametric efficiency under $o_P(n^{-1/4})$ rates for each nuisance	(Bia et al., 2020, Zeng, 2022)
Dependent data	As-IID cross-fitting valid under CLT, weak dependence	(Balkus et al., 15 Jan 2026)
Small $n$	Full-sample or out-of-bag preferable for nuisance estimation	(Okasa, 2022)
Double/triple cross-fitting	Accelerates rates, achieves efficiency under minimal smoothness	(Newey et al., 2018, McClean et al., 2024, Fisher et al., 2023)
Variance estimation	Requires robust estimators (not handled by cross-fitting per se)	(Balkus et al., 15 Jan 2026, Ellul et al., 2024)
Conditional design	Conditional cross-fitting restores unbiasedness under finite populations	(Lu et al., 21 Aug 2025)

Cross-fitting is a statistically rigorous, robust, and computationally tractable method for bias correction and valid inference in data-adaptive estimation with complex nuisance fitting, extending to correlated and dependent data, with wide adoption across modern causal and high-dimensional inference. Current methodological frontiers include optimizing cross-fitting for dense dependence structures, formalizing stability-based alternatives, and developing efficient implementations for highly structured experimental designs.