Doubly Robust Orthogonal Moment Estimators

Updated 22 October 2025

The paper demonstrates that using orthogonal moment conditions can achieve √n-consistency, even with suboptimal nuisance estimation, through double robustness.
It leverages techniques like cross-fitting and double cross-fitting to decouple nuisance function estimation from the target parameter inference.
The estimators apply robust methods in causal inference, missing data analysis, and econometrics, providing efficient and consistent estimation in complex models.

Doubly robust orthogonal moment estimators comprise a class of inferential procedures designed to target low-dimensional parameters in models characterized by high- or infinite-dimensional nuisance components. These estimators leverage moment equations that are constructed to be insensitive, in a well-defined first-order (orthogonality or Neyman orthogonality) sense, to the presence of nuisance estimation error. Their central property is “double robustness”: consistency and asymptotic normality can be achieved if at least one of two (or more) nuisance models is correctly specified or estimated at a sufficiently fast rate. This structure underpins many developments in modern causal inference, missing data analysis, semiparametric statistics, and econometrics.

1. Theoretical Foundations: Orthogonality and Double Robustness

Central to doubly robust orthogonal moment estimators is the orthogonal moment condition. Let $Z$ be the observed data, $\theta$ the finite-dimensional target parameter, and $h(X)$ a potentially high-dimensional or unknown nuisance function. A moment function $m(Z, \theta, h(X))$ is said to be Neyman-orthogonal (or locally robust) if

$E [ \nabla_{h} m(Z, \theta_0, h_0(X)) \mid X ] = 0,$

where $\nabla_h$ denotes the (Fréchet) derivative with respect to the nuisance. This condition guarantees that the first-order effect of nuisance estimation error on the behavior of the estimator for $\theta$ vanishes, enabling $\sqrt{n}$ -consistency and valid inference even when $h$ is estimated at slower than $\sqrt{n}$ rates, provided an $o(n^{-1/4})$ convergence (first-order orthogonality) (Syrgkanis, 2017, Mackey et al., 2017).

Higher-order $k$ -orthogonality strengthens this robustness. If the moment is $k$ -orthogonal (i.e., all derivatives up to order $k$ vanish in expectation), the convergence requirement for estimating $h$ is weakened to $o(n^{-1/(2k+2)})$ (Mackey et al., 2017). For example, second-order orthogonality (sometimes referred to as “doubly-orthogonal” in the literature) demands that both the first and second derivatives with respect to nuisances vanish, permitting the use of noisier machine learning nuisance estimates while maintaining valid inference for $\theta$ .

The double robustness property refers to the alternative sufficient paths to consistency: the estimator remains consistent and asymptotically normal for $\theta_0$ if at least one of the working models (e.g., outcome regression or propensity score) is correct (Waernbaum et al., 2017, Evans et al., 2018). Orthogonality of the moment underlies this phenomenon, since it shields the target from first-order nuisance estimation errors.

2. Construction of Doubly Robust Orthogonal Moment Estimators

A canonical example is the augmented inverse probability weighted (AIPW) or efficient influence function-based estimator for average treatment effects: $\tau_{\rm AIPW} = \frac{1}{n} \sum_{i=1}^n \left( \frac{T_i(Y_i - \mu_1(X_i))}{e(X_i)} - \frac{(1-T_i)(Y_i-\mu_0(X_i))}{1-e(X_i)} + \mu_1(X_i) - \mu_0(X_i) \right),$ where $\mu_t(X) = E[Y \mid T = t, X]$ is the conditional outcome, $e(X) = P(T=1 \mid X)$ the propensity, and the first-order orthogonality of the influence function underpins both double robustness and semiparametric efficiency (Waernbaum et al., 2017, Syrgkanis, 2017).

Application of orthogonal moments proceeds generally as follows:

Estimate high-dimensional nuisance functions (e.g., $\mu_t(\cdot), e(\cdot)$ ) using suitable regression or machine learning procedures, possibly with regularization.
Build plug-in estimates of the orthogonal moment as a functional of the data and the estimated nuisances.
Solve the estimating equation (empirical moment equation) for $\theta$ , often employing sample splitting or cross-fitting to ensure independence between nuisance estimation and target estimation steps (Syrgkanis, 2017, Ghosh et al., 2020).

The principle extends to missing data (inverse probability weighting plus outcome regression), semiparametric regression with data fusion and other irregular designs (Evans et al., 2018), as well as to kernel methods for distributional causal effects (Fawkes et al., 2022).

3. Rate and Structure-Agnostic Generalizations

Doubly robust estimators achieve $\sqrt{n}$ -consistency under $n^{-1/4}$ -rate convergence for the nuisance, if the moment is first-order orthogonal; higher-order $k$ -orthogonality permits $o(n^{-1/(2k+2)})$ rates (Mackey et al., 2017). For example, in partially linear regression models for causal effect estimation with non-Gaussian “treatment noise,” second-order orthogonal moments exist, and require only $n^{-1/6}$ -rate nuisance estimation.

A major technical advance is provided by double cross-fitting (“DCDR” estimators), where two nuisance functions are estimated on separate, independent folds of the data and the final estimate is evaluated on a third fold (McClean et al., 22 Mar 2024). This technique allows for aggressive undersmoothing of nuisances (e.g., using smaller bandwidths in kernel regression), further reducing bias.

Key error expansions for DCDR estimators are: $\widehat{\psi}_n - \psi_{ecc} = (P_n - P)\{\varphi(Z)\} + R_{1n} + R_{2n},$ where $R_{1n}$ is a product of integrated biases in the two nuisance estimators, and $R_{2n}$ is a second-order variance term due to training-sample overlap. Careful undersmoothing, combined with double cross-fitting, ensures that the first-order term dominates and valid asymptotic normality is achieved even if both nuisance functions are only estimated at sub- $\sqrt{n}$ rates, as long as the product of these rates decays at a sufficiently rapid rate (McClean et al., 22 Mar 2024).

4. Goodness-of-Fit, Model Assessment, and Extensions

Some construction procedures explicitly introduce extended nuisance models, parameterized to nest the baseline models, specifically to equip the framework with diagnostic capabilities. For example, in MNAR with a shadow variable, an extended propensity score includes a parameter $\phi$ ; if the baseline model is true, then $\hat\phi\to0$ , providing a direct test for misspecification (Miao et al., 2015).

Similarly, in semiparametric inference with high-dimensional data, regularized calibration (RCAL) targets the (first-order) score equations for the nuisances directly in the estimation procedure, ensuring the removal of spurious first-order noise in the expansion for the target parameter, while retaining the double robustness of the moment construction (Ghosh et al., 2020). This approach assures that calibration errors become negligible, so valid Wald-type and bootstrap inference is attainable.

Extensions of this methodology include kernelized moments for testing distributional treatment effects, which achieve double robustness and improved convergence rates compared to standard kernel mean embedding estimators (Fawkes et al., 2022). Multiply robust methods further generalize the estimator to remain consistent as long as any one of several (e.g., three or four) working models are correctly specified (Zhou, 2020).

5. Applications: Causal Inference, Missing Data, Data Fusion, and Panel Models

Doubly robust orthogonal moment estimators are widely used in:

Causal inference with observational data: estimation of average treatment effects via orthogonal moments as in double/debiased machine learning (DML), robust to high-dimensional/nuisance estimation via nonparametric or machine learning methods (Syrgkanis, 2017, Huang et al., 2021).
Survival analysis: doubly robust TMLE and AIPW estimators enable valid inference for survival probabilities under informative censoring, leveraging ensemble learners for nuisance components and cross-fitting to avoid entropy conditions (Díaz, 2017).
Directed acyclic graphs and mediation: estimation of controlled direct effects using DR, triply robust, or quadruply robust moments (Zhou, 2020).
Missing not at random (MNAR) problems with shadow variables: doubly robust estimators via augmented models for the missingness mechanism and outcome models, employing shadow variables to achieve identification where MAR is violated (Miao et al., 2015).
Data fusion settings: inference on parameters of interest when no subject has complete data, under ignorability and positivity, via DR moment equations that combine modeling the data source process and distribution of missing covariates (Evans et al., 2018).
Generalized method of moments (GMM): variance estimation is robustified by incorporating double bias correction, yielding reliable inference in the presence of overidentification and possible misspecification (Hwang et al., 2019, Kleibergen et al., 2021).

Specialized kernels and representation learning approaches further extend DR estimators to high-dimensional and structured outcomes (Chen et al., 10 Dec 2024).

6. Existence, Informational Content, and Informative Orthogonal Moments

The existence of orthogonal (locally robust) moment functions is determined by the restricted local non-surjectivity (RLN) condition, which requires the moment to be orthogonal (in $L_2$ ) to the tangent space of the nuisance parameter (Argañaraz et al., 2023). Orthogonality alone is not sufficient; the moment must be informative for the parameter of interest, characterized by the efficient Fisher information matrix being nonzero. In semiparametric models with high-dimensional sparsity or unobserved heterogeneity, this condition enables inference on nonregular parameters by constructing valid orthogonal moments insensitive to nuisance estimation.

RLN and efficient Fisher information are applicable to heterogeneous treatment effect estimation (e.g., the Oregon Health Experiment analysis), instrumental variables with many instruments, sample selection models, and models of demand for differentiated products (Argañaraz et al., 2023).

7. Practical Considerations and Limitations

Sample splitting/cross-fitting: To prevent bias from using estimated nuisance functions on the same data as the estimation step, sample splitting (and multi-fold cross-fitting) is essential for orthogonality to yield the desired robustness (Syrgkanis, 2017).
Nuisance estimation rates: The product of errors in the two working models must satisfy a rate condition (typically $o(n^{-1/2})$ for first-order orthogonality).
Model misspecification: Double robustness reduces, but does not eliminate, bias when both working models are incorrect; the bias can remain substantial if, for example, the propensity model is grossly wrong (Waernbaum et al., 2017).
Computation: Regularized calibrated estimation, double cross-fitting, kernel mean embedding computations, and efficient sample splitting can be computationally demanding for large datasets but remain tractable and beneficial due to the improved error guarantees (Ghosh et al., 2020, McClean et al., 22 Mar 2024).

Doubly robust orthogonal moment estimators represent an overview of semiparametric theory, robust estimation, and modern machine learning, providing a rigorous and adaptable toolkit for high-dimensional and complex data regimes across causal inference, missing data analysis, and econometric modeling. The theoretical guarantees under suitable moment orthogonality and practical innovations—such as cross-fitting, model selection, and flexibility of nuisance estimation—have solidified their role as foundational tools for reliable inference in modern statistics and econometrics.