Two-Stage Nonparametric Estimation

Updated 4 December 2025

Two-stage nonparametric estimation is a sequential procedure that first estimates latent variables or nuisance parameters nonparametrically before using these estimates in a second-stage regression to infer the target function.
It applies methods such as kernel smoothing and series expansion in the first stage and local-linear regression in the second, addressing issues like unobserved covariates and endogeneity.
The approach balances bias-variance tradeoffs through careful bandwidth selection and undersmoothing, achieving minimax optimal rates and asymptotic normality under standard regularity conditions.

Two-stage nonparametric estimation refers to a broad class of statistical procedures in which parameter or function estimation is performed sequentially: a first stage is devoted to estimating nuisance parameters or latent variables (often by nonparametric methods), and a second stage utilizes the outputs of the first stage as covariates or as components in constructing the final estimator. This framework arises in numerous contexts, including generated regressor models, semiparametric instrumental variables, off-policy evaluation, structural models, and modern causal inference. Two-stage strategies are essential in situations where certain regression inputs or quantities of interest are not directly observable and must be estimated from data.

1. General Methodology

The canonical setup for two-stage nonparametric estimation involves a structural model: $Y_i = m(T_i, X_i) + \varepsilon_i, \quad E[\varepsilon_i \mid T_i,X_i] = 0,$ where $T_i = g(Z_i)$ is a latent or constructed covariate, not observed directly but estimated in the first stage. The primary goal is to estimate $m(t,x) = E[Y \mid T=t, X=x]$ . The estimation proceeds as follows:

Stage 1 (Pilot/First-Stage Estimation):

Estimate $g(\cdot)$ nonparametrically, obtaining $\hat{T}_i = \hat{g}(Z_i)$ via kernel smoothing or series/sieve expansion. The resulting error is typically characterized as

$\delta_n = O_p(h_1^{q+1} + (n h_1^{d_Z})^{-1/2}),$

with $h_1$ the pilot-bandwidth and $q$ the local polynomial order (Mammen et al., 2012).

Stage 2 (Main/Second-Stage Estimation):

Treat the estimated $\hat{T}_i$ as regressors and perform a nonparametric regression (typically local-linear) of $Y_i$ on $(\hat{T}_i, X_i)$ : $(\hat{m}(x), \hat{\beta}) = \arg\min_{\alpha, \beta} \sum_{i=1}^n \left[Y_i - \alpha - \beta^\top (\hat{R}_i-x)\right]^2 K_{h_2}(\hat{R}_i-x),$ where $\hat{R}_i=(\hat{T}_i, X_i)$ , $x=(t,x)$ , and $K_{h_2}$ is a product kernel with bandwidth $h_2$ .

2. Asymptotic Theory and Stochastic Expansions

A key result from (Mammen et al., 2012) is a stochastic expansion quantifying the impact of the first-stage estimation on the final estimator: $\hat{m}(x) - m(x) = A_n(x) + B_n(x) + o_p(r_n),$ where:

$A_n(x)$ is the primary (oracle) term, which would be the error if $T_i$ were observed.
$B_n(x)$ is an explicit correction term arising from using $\hat{T}_i$ instead of $T_i$ .
$r_n$ is the rate dictated by the slower of the two stages.

The oracle term achieves standard local-linear rates: $A_n(x) = O_p(h_2^p + (n h_2^{d_R})^{-1/2}),$ with $p=$ local polynomial order and $d_R=$ dimension of $(T,X)$ . The correction term from first-stage estimation is: $B_n(x) = O_p(h_1^{q+1} + (n h_1^{d_Z})^{-1/2}),$ where $h_1$ , $q$ , and $d_Z$ are respective first-stage analogues. Therefore, the overall rate is determined by the slower rate between the two stages.

3. Rates of Consistency, Bias-Variance Trade-offs, and Bandwidth Selection

The composite convergence rate is

$r_n = \max\{\gamma_1, \gamma_2\},$

where $\gamma_1$ is associated with the first-stage rate and $\gamma_2$ with the second. Typical optimal bandwidth choices are: $\beta = \frac{1}{2p + d_R}, \qquad \alpha = \frac{1}{2(q+1) + d_Z},$ yielding

$r_n = \min\{\beta p, (1-\beta d_R)/2, \alpha(q+1), (1-\alpha d_Z)/2\}$

(Mammen et al., 2012). To ensure the first-stage error does not contaminate the second-stage inference, the first stage must be "sufficiently undersmoothed," i.e.,

$h_1^{q+1} + (n h_1^{d_Z})^{-1/2} = o((n h_2^{d_R})^{-1/2}).$

This ensures that $B_n(x)$ is negligible relative to the oracle term.

4. Extensions and Applications Across Domains

Generated Covariate and Semiparametric IV Models

Two-stage nonparametric procedures naturally extend to models involving unobserved covariates (generated regressors), simultaneous equation estimation, or models with endogeneity, where the first stage recovers instrumental variable functions (Mammen et al., 2012, Chen et al., 2022). In off-policy reinforcement learning, for instance, Q-function estimation is cast as a two-stage NPIV problem with well-posedness and minimax-optimal rates established for sieve two-stage least squares estimators under the given Markov structure (Chen et al., 2022).

Design in Two-Phase and Missing Data Studies

Two-stage nonparametric corrections underpin optimal estimation in stratified or two-phase designs, where full covariate information is missing in a large sample but is available in a smaller subset. Here, a first-stage "complete-case" estimator is augmented using nonparametric smoothers built from the observed auxiliary data. Kernel-based projection ensures semiparametric efficiency, with joint updates available for multiple working models (Zhou et al., 13 Oct 2025).

Threshold and Mode Estimation

Specialized two-stage designs exploit monotonicity or smoothness to accelerate estimation for regression function thresholds or maxima locations. Stage one is a coarse nonparametric estimate (e.g., isotonic fit or modal regression), while stage two "zooms-in" (via adaptive resampling or refined polynomial fits) to achieve faster, sometimes dimension-independent, convergence rates (Belitser et al., 2013, Tang et al., 2013). In well-behaved cases, rates progress from $n^{1/3}$ (isotonic) or $n^{-(\alpha-1)/(2\alpha+d)}$ (mode) to $n^{(1+\gamma_1)/3}$ or $n^{-(\alpha-1)/(2\alpha)}$ .

Modern Causal Inference and Treatment Effect Estimation

Two-stage nonparametric estimators are foundational in estimating heterogeneous treatment effects and long-term causal effects under data combination settings. The first stage fits high-dimensional or flexible nuisance models (propensity, outcome regressions). Custom pseudo-outcomes (regression-based, propensity-based, or multiply robust) are then regressed nonparametrically in the second stage. Multiple-robustness enables consistency if any among several nuisance estimation components is correct, with asymptotic rates controlled by the slowest or the product of rates (Gao et al., 2020, Chen et al., 26 Feb 2025).

5. Theoretical Guarantees: Asymptotic Normality, Variance, and Inference

Under standard regularity conditions (e.g., sufficient smoothness, tail properties, and appropriate bandwidth choices), the two-stage estimator $\hat{m}(x)$ is asymptotically normal: $\sqrt{n h_2^{d_R}} \, [\hat{m}(x) - m(x) - \mathrm{Bias}(x)] \overset{d}{\longrightarrow} N(0, \sigma^2(x)),$ where

$\sigma^2(x) = \frac{1}{f_R(x)} \int K(u)^2 du \left\{ \mathrm{Var}(\varepsilon \mid R=x) + [m_t'(x)]^2 \mathrm{Var}(g\text{-error} \mid R=x) \right\}.$

Thus, the limiting variance includes both the conditional variance of $\varepsilon$ and an excess variance due to the first-stage estimation error, but the leading rate and efficiency are not affected if the first-stage is sufficiently fast (Mammen et al., 2012).

6. Major Implementation Strategies and Variants

Practical implementation requires careful selection of bandwidths, pilot smoother order, and kernel. In high dimensions or with limited samples, series/sieve methods, ridge regularization, and deep nets (for density estimation or compositional regression) are commonly deployed (Obenchain, 2023, Bos et al., 2023).

Strategies include:

Basis expansion then penalized regression: Construct a nonparametric basis (splines, wavelets), then perform ridge regression with appropriate penalties. This encompasses generalized ridge and penalized spline methods (Obenchain, 2023).
Matching, nearest-neighbor, or bagging-based approaches: For regression or causal inference under complex designs, matching quality can be optimized in a two-stage sequence, e.g., distributional nearest neighbor estimators, with bias correction via two scales to achieve minimax optimal rates (Demirkaya et al., 2018).

7. Practical Guidelines and Limitations

Two-stage nonparametric estimators achieve minimax optimality as long as:

The first-stage estimator achieves a bias/variance rate strictly faster than the second, or is undersmoothed as needed.
Sufficient regularity/smoothness conditions hold for all relevant functions.
High-dimensional or complex settings are handled via cross-fitting, sample splitting, and regularization to maintain stability and finite-sample properties (Mammen et al., 2012, Chen et al., 26 Feb 2025, Zhou et al., 13 Oct 2025).

A key limitation in some settings is "curse of dimensionality," but in several two-stage procedures for maxima or threshold estimation, the second-stage rate can become dimension-free under sufficient local smoothness (Belitser et al., 2013, Tang et al., 2013). Multiple-robust and orthogonal pseudo-outcome constructions offer enhanced protection against model misspecification or slow nuisance estimation by exploiting Neyman-orthogonality (Gao et al., 2020, Olma, 2021).

References

Nonparametric regression with nonparametrically generated covariates (Mammen et al., 2012)
Optimal two-stage procedures for estimating location and size of the maximum of a multivariate regression function (Belitser et al., 2013)
Two-Stage Plans for Estimating a Threshold Value of a Regression Function (Tang et al., 2013)
Optimal Nonparametric Inference with Two-Scale Distributional Nearest Neighbors (Demirkaya et al., 2018)
Minimax Optimal Nonparametric Estimation of Heterogeneous Treatment Effects (Gao et al., 2020)
An optimal two-step estimation approach for two-phase studies (Zhou et al., 13 Oct 2025)
Nonparametric Heterogeneous Long-term Causal Effect Estimation via Data Combination (Chen et al., 26 Feb 2025)
Nonparametric Generalized Ridge Regression (Obenchain, 2023)
A supervised deep learning method for nonparametric density estimation (Bos et al., 2023)
Nonparametric Estimation of Truncated Conditional Expectation Functions (Olma, 2021)
On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function Estimation in Off-policy Evaluation (Chen et al., 2022)
Proximal Causal Learning with Kernels: Two-Stage Estimation and Moment Restriction (Mastouri et al., 2021)
A Statistical Decision-Theoretical Perspective on the Two-Stage Approach to Parameter Estimation (Lakshminarayanan et al., 2022)
Conditional-Marginal Nonparametric Estimation for Stage Waiting Times from Multi-Stage Models under Dependent Right Censoring (Zhang et al., 23 Apr 2025)
Two-Stage Maximum Score Estimator (Gao et al., 2020)

Markdown Upgrade to Chat

References (15)

Nonparametric regression with nonparametrically generated covariates (2012)

On Well-posedness and Minimax Optimal Rates of Nonparametric Q-function Estimation in Off-policy Evaluation (2022)

An optimal two-step estimation approach for two-phase studies (2025)

Optimal two-stage procedures for estimating location and size of the maximum of a multivariate regression function (2013)

Two-Stage Plans for Estimating a Threshold Value of a Regression Function (2013)

Minimax Optimal Nonparametric Estimation of Heterogeneous Treatment Effects (2020)

Nonparametric Heterogeneous Long-term Causal Effect Estimation via Data Combination (2025)

Nonparametric Generalized Ridge Regression (2023)

A supervised deep learning method for nonparametric density estimation (2023)

10.

Optimal Nonparametric Inference with Two-Scale Distributional Nearest Neighbors (2018)

11.

Nonparametric Estimation of Truncated Conditional Expectation Functions (2021)

12.

Proximal Causal Learning with Kernels: Two-Stage Estimation and Moment Restriction (2021)

13.

A Statistical Decision-Theoretical Perspective on the Two-Stage Approach to Parameter Estimation (2022)

14.

Conditional-Marginal Nonparametric Estimation for Stage Waiting Times from Multi-Stage Models under Dependent Right Censoring (2025)

15.

Two-Stage Maximum Score Estimator (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Two-Stage Nonparametric Estimation.