Cross-Fitting in Semiparametric Estimation

Updated 8 March 2026

Cross-fitting is a sample-splitting methodology that partitions data into folds to separate nuisance estimation from evaluation, thereby eliminating own-observation bias.
It allows the use of highly adaptive machine learning regressors, achieving root-n consistency and semiparametric efficiency under mild rate conditions.
The technique extends to various complex settings—such as time series, clustered, and network data—while ensuring unbiased estimation even in experimental designs.

Cross-fitting is a sample-splitting methodology fundamental to semiparametric statistical estimation and high-dimensional causal inference. It separates model fitting (“nuisance” estimation) from evaluation to eliminate own-observation bias and relax empirical-process complexity assumptions, permitting the use of highly adaptive machine learning regressors. Cross-fitting underlies debiased/double machine learning, efficient estimation of average and heterogeneous effects, and general plug-in estimation of linear functionals. Its rigorous asymptotic properties hold for independent, clustered, time-series, networked, and even non-i.i.d. randomized design data, and, under appropriate conditions, lead to root-n consistency and semiparametric efficiency.

1. Cross-Fitting: Formalism and Algorithms

Cross-fitting partitions observed units $\{Z_i\}_{i=1}^n$ into $K$ disjoint folds $I_1, \ldots, I_K$ of size $n/K$ . For each $k$ , a model for nuisance parameter(s) $\hat{\eta}^{I_k^c}$ is trained on the complement $I_k^c$ , and its predictions are plugged into the estimating equations evaluated on the held-out fold $I_k$ . The generic K-fold cross-fitted estimator for a statistical functional $\beta_0 = \mathbb{E}[m(Z, \eta_0)]$ is

$\hat\beta_{\mathrm{CF}} = \frac{1}{n} \sum_{k=1}^K \sum_{i\in I_k} m(Z_i, \hat{\eta}^{I_k^c}).$

This construction ensures that $\hat{\eta}^{I_k^c}$ is independent of the data used for evaluation, breaking the dependency that causes overfitting bias.

For doubly robust functionals involving multiple nuisances (e.g., regression and propensity), cross-fitting can be extended to “doubly cross-fitted” estimators, where different nuisance components are fit on non-overlapping partitions, completely eliminating higher-order bias terms associated with own-observation and estimator nonlinearity. The paradigm generalizes to conditional functionals and heterogeneous effect estimation, with further modifications for multiple treatments and complexities of the design, as in randomized experiments or changepoint models (Zeng, 2022, Newey et al., 2018, Jacob, 2020, Fisher et al., 2023, Qian et al., 2024).

2. Bias Elimination, Rate Conditions, and Efficiency

Cross-fitting decomposes the estimation error of plug-in estimators into:

A main term corresponding to the mean-zero empirical process.
A low-order nuisance-bias, $\mathbb{E}[m(Z, \hat{\eta}) - m(Z, \eta_0)]$ , which is second order when the functional is Neyman-orthogonal.
The empirical-process bias, $(\mathbb{P}_n - \mathbb{P})[m(Z, \hat{\eta}) - m(Z, \eta_0)]$ , which is the principal obstacle and is eliminated by cross-fitting because fitting and evaluation occur on independent data.

Under appropriate consistency rates for the nuisance estimators—e.g., product of $L_2$ risks $o_p(n^{-1/2})$ for doubly robust estimators or pointwise risk $o_p(n^{-1/4})$ for plug-in—the cross-fit estimator attains root-n consistency, asymptotic linearity, and achieves the semiparametric efficiency bound: $\sqrt{n} (\hat\beta_{\mathrm{CF}} - \beta_0) = \frac{1}{\sqrt{n}}\sum_{i=1}^n \mathrm{IF}(Z_i) + o_p(1)$ where $\mathrm{IF}$ is the efficient influence function for the parameter of interest. The smallest achievable remainder rate for average linear functionals with multi-stage series estimation is

$A_n^* = n^{-1/2} K^{-(s_y+s_a)/r} + K^{-s_y/r} + K^{-s_a/r} + K/n$

with $K$ the basis size, $s_y, s_a$ the Hölder smoothness of the respective nuisances, and $r$ the covariate dimension (Newey et al., 2018).

3. Extensions to Complex Dependence Structures

Contrary to prior intuition, cross-fitting remains effective for dependent data—spatial, clustered, networked, and time-series settings—so long as an appropriate central limit theorem for averages over independent units (or effective sample size) holds and the total number of strongly correlated pairs grows no faster than $O(n^2 / r_n^2)$ , with $r_n$ the CLT rate (Balkus et al., 15 Jan 2026). Standard random K-fold splits suffice; there is no need to “block” or “cluster” the folds to minimize dependence. The main theoretical guarantee is that the empirical-process term is still controlled at the parametric rate $o_p(1/r_n)$ , so estimator validity and efficiency are preserved.

Practical guidelines are summarized in the table below:

Setting	Cross-Fitting Required?	Modifications Needed
IID data	Yes	Standard random folds
Clustered data	Yes	No cluster-specific folding needed
Network data	Yes	As above
Time series (m-dep)	Yes	Standard folding sufficient

Empirically, cross-fitting without correlation-aware splitting yields equivalent or improved bias/variance compared to more complex partitioning, particularly in finite samples (Balkus et al., 15 Jan 2026).

4. Conditional and Adaptive Cross-Fitting in Experimental Designs

In classical randomized experiments with dependent assignments (e.g., block, stratified, or completely randomized designs), traditional cross-fitting based on i.i.d. random splits is invalid, as assignment vectors across folds are not independent. Conditional cross-fitting solves this by selecting splits such that, given the split, assignments in different folds are conditionally independent under the design (Lu et al., 21 Aug 2025). Examples include:

Bernoulli randomized experiments: split units independently.
Completely randomized experiments: split within each arm to preserve treatment group sizes.
Stratified randomized experiments: split within strata or by strata groups.

These procedures restore unbiasedness of estimator plug-in corrections, yielding unbiased covariate-adjusted ATE estimators even in finite samples, regardless of the accuracy of machine-learning predictions. Valid confidence intervals are constructed using design-based variance estimators, and the method accommodates arbitrary regression learners with only mild stability requirements (Lu et al., 21 Aug 2025).

5. Multi-Way and Three-Way Cross-Fitting Strategies

While classical two-way (or K-fold) cross-fitting suffices for standard bias reduction, further splitting is sometimes used to disentangle dependencies among multiple nuisance estimators, accelerating remainder decay and enhancing finite-sample performance.

Three-way cross-fitting partitions the data into three equal parts; each is used exclusively for a single stage (e.g., one for estimating the propensity, one for the outcome regression, and one for final pseudo-outcome regression). Bias terms associated with dependence among nuisance estimators are thereby reduced from $O(k_n/n)$ to $O(k_n^{1/2-s/r}/\sqrt{n})$ or better under smoothness conditions (Fisher et al., 2023).
In plug-in estimation of general linear functionals, double cross-fitting ensures that remainder rates match the best known minimax-optimal rates in high-dimensional estimation (Newey et al., 2018).

However, increasing the number of splits reduces the data available in each partition for model fitting, potentially compromising nuisance estimation accuracy. In practice, one rotates splits and averages over multiple random partitions to mitigate efficiency loss (Fisher et al., 2023, Jacob, 2020).

6. Applications and Scope

Cross-fitting is used widely for:

Doubly robust estimation of average and heterogeneous treatment effects with flexible (non-Donsker) machine learners (Zeng, 2022, Jacob, 2020).
Covariate adjustment in randomized controlled trials, meta-learners (DR, R, T, X-learners), and distributional functionals under missing data.
Changepoint detection, wherein cross-fitting replaces in-sample loss minimization to counteract overfitting bias and guarantees recovery of changepoints with high probability, even under highly adaptive ML estimators and automated hyperparameter tuning (Qian et al., 2024).

In all these domains, cross-fitting enables aggressive use of black-box predictive models while preserving rigorous inferential validity, provided rate conditions and design-consistency properties are verified in context.

7. Limitations and Alternatives

Despite its foundational role, cross-fitting can present computational burdens in complex models, especially in clustered data settings, where re-fitting nuisances on multiple splits significantly increases required resources. In high-dimensional dependence regimes (multiway clustering, separately exchangeable arrays), recent advances prove that Neyman-orthogonal moment conditions and empirical-process localization arguments can eliminate the need for cross-fitting entirely, without loss of asymptotic linearity or validity of standard error estimators (Chen et al., 11 Feb 2026). In such cases, full-sample estimation of nuisance parameters is provably justified, regaining efficiency and computational practicality.

Practical limitations include:

Need for sufficiently large data within each fold to fit stable nuisance estimators.
Correct design-based splitting mechanisms for conditional cross-fitting in complex experimental designs.
Trade-off between split count (K) and variance in final estimators; typical values are $K=2$ –$5$.
Absence of off-the-shelf cross-fitting folding mechanisms for certain adaptive or rerandomized designs (Lu et al., 21 Aug 2025).

In summary, cross-fitting represents a central tool in the modern theory and practice of semiparametric estimation under flexible machine learning, with formal theoretical guarantees, broad applicability, and proven efficiency recovery in a range of complex data-generating scenarios (Zeng, 2022, Balkus et al., 15 Jan 2026, Newey et al., 2018, Lu et al., 21 Aug 2025, Jacob, 2020, Chen et al., 11 Feb 2026, Fisher et al., 2023, Qian et al., 2024).