Doubly Robust Estimators
- Doubly robust estimators are defined by their ability to yield consistent estimates when either the selection model via IPW or the outcome model via mass imputation is correctly specified.
- They combine distinct methodologies and provide explicit variance and covariance formulas, ensuring robust inference even under complex survey data conditions.
- An optimal linear combination with probability estimators enhances efficiency, making these methods crucial for integrating nonprobability and probability survey samples.
Doubly robust estimators are a prominent class of semiparametric estimators, designed to deliver consistent inference on target parameters (such as population means or causal effects) even under model misspecification. The defining characteristic of these estimators is their “double robustness”: consistency is achieved if at least one of two nuisance models—typically a model for the selection/response mechanism and a model for the outcome (mass imputation)—is correctly specified. The framework is especially valuable for integrating survey data from nonprobability and probability samples, where missingness and selection processes can be complex and only partial information is available.
1. Foundations and Structure of Doubly Robust Estimators
Doubly robust estimators combine two distinct methodologies for adjusting for selection or missingness:
- Inverse Probability Weighting (IPW): The estimator models the selection or missingness probability πᴮ(X; α), typically via parametric regression (e.g., logistic model), and reweights the observed outcomes accordingly.
- Mass Imputation (MI): An imputation model m(X; β) predicts missing outcomes using observed covariates, applied either via parametric or nonparametric regression.
The basic doubly robust estimator for estimating the finite-population mean takes the schematic form: where denotes selection into the sample with observed Y, and and are the estimated selection and outcome models. Alternative forms use Hajek-type normalization.
Double robustness is manifested in the fact that if either πᴮ(X; α) or m(X; β) is correctly specified, remains a consistent estimator for the target population mean or prevalence, thus providing a safeguard against model misspecification.
2. Variance and Covariance Formulas for Doubly Robust Estimators
The paper provides explicit formulas for the asymptotic variance and the covariance between the doubly robust estimator and probability-sample-only estimators:
Variance Formula
Let Δₘᵢ and Δ_yᵢ denote adjustment terms that depend on whether one or both nuisance models are correctly specified (Δₘᵢ=0, Δ_yᵢ=m(Xᵢ; β) when both are correct): $\Var(\widehat{Y}_{DR} - Y | F) = \frac{1}{N}\sum_{i=1}^{N} \frac{R^{A}_{i}}{\pi^{A}_{i}} [m(X_i;\beta) - \Delta_{mi}] + \frac{1}{N^2}\sum_{i=1}^{N} \frac{1-\pi^{B}(X_i)}{\pi^{B}(X_i)} E[(Y_i-\Delta_{yi})^2 | F] + o_p(N^{-1}).$
Covariance Formula
The covariance between the DR estimator and a probability sample estimator (either Horvitz–Thompson or Hajek) is: $\Cov(\widehat{Y}_{DR} - Y,\, \widehat{Y}_{PS} - Y\,|\,F) = \left[ \frac{1}{N}\sum_{i=1}^{N} \frac{R^{A}_{i}}{\pi^{A}_{i}} ( m(X_i;\beta)-\Delta_{mi} ),\; \frac{1}{N}\sum_{i=1}^{N} \frac{R^{A}_{i}}{\pi^{A}_{i}} (Y_i-\Gamma) \right] + o_p(N^{-1}).$ where Γ is 0 for Horvitz–Thompson and Y for Hajek estimator.
These explicit forms are crucial for proper variance estimation, especially when combining information from both probability and nonprobability surveys.
3. Efficient Combination of Nonprobability and Probability Sample Estimates
When Y is available for both the nonprobability and probability samples, the paper proposes an optimal linear combination of the DR estimator and the direct probability sample estimator: where the optimal weight is analytically given by: $w = \frac{\Var(\widehat{Y}_{DR}-Y|F) - \Cov(\widehat{Y}_{DR}-Y,\widehat{Y}_{PS}-Y|F)} {\Var(\widehat{Y}_{DR}-Y|F) + \Var(\widehat{Y}_{PS}-Y|F) - 2\,\Cov(\widehat{Y}_{DR}-Y,\widehat{Y}_{PS}-Y|F)}.$ The variance of this combined estimator is derived by: $\Var(\widehat{Y}_{combined} - Y | F) = (1-w)^2\,\Var(\widehat{Y}_{DR}-Y|F) + 2w(1-w)\,\Cov(\widehat{Y}_{DR}-Y,\widehat{Y}_{PS}-Y|F) + w^2\,\Var(\widehat{Y}_{PS}-Y|F).$ This framework ensures maximal efficiency by leveraging the correlation inherent when both estimates share covariate information from the auxiliary probability sample.
4. Comparative Features and Properties
Estimator Type | Consistency (when models correct) | Handles Model Misspec. | Limiting Variance Formula |
---|---|---|---|
IPW | Needs π only | No | Standard IPW |
Mass Imputation | Needs m only | No | MI variance |
Doubly Robust | Needs m or π | Yes | See Section 2 formula |
Combined Estimator | Needs m or π or design | Yes (hybrid) | Weighted sum/covariance |
When both models are correctly specified, the DR estimator achieves efficiency equivalent to the direct probability estimator. If only one model is correct, the DR estimator still outperforms IPW or MI alone. If neither model is correct, consistency fails—a limitation common to all such approaches.
5. Practical Implications for Survey Data Integration
The described DR framework provides a theoretically grounded and practically implementable approach to integrating nonprobability and probability survey data (Seaman et al., 7 Aug 2025). It enables the analyst to:
- Achieve consistent inference provided at least one model is sufficient—critically important in settings where nonprobability survey selection may be poorly understood.
- Attain improved efficiency by optimally combining DR-adjusted and probability sample estimators using explicit variance-covariance estimates.
- Recognize that dependence between DR and probability estimators (induced by shared covariates) must be accounted for to avoid underestimating standard errors.
The methodology is directly applicable in official statistics and large-scale observational studies where auxiliary surveys can supplement nonprobability samples, allowing robust estimation of finite-population means, prevalences, or totals.
6. Theoretical and Methodological Considerations
DR estimators rely on strong technical assumptions for their consistency and efficiency properties. Notably:
- Correct specification of at least one nuisance model is necessary for double robustness.
- All variance formulas and efficiency results presuppose regularity conditions (e.g., positivity, bounded inverse weights, parametric estimation).
- In many practical cases, the selection and outcome models may need to contain the same covariates for theoretical guarantees (as in the Kim–Haziza construction).
- Computational considerations arise when estimating the covariance across estimators sharing overlapping covariate data, necessitating appropriate resampling or replication variance estimation techniques.
Extensions to settings where both outcome and selection models are high-dimensional or nonparametric, or where machine learning methods are used for nuisance estimation, are important contemporary areas of research but may require additional methodological safeguards.
7. Summary Table: DR, IPW, and Combined Estimator Properties
Estimator | Requires Known Design | Consistent If... | Incorporates Aux. Covariates | Variance Formula (Key) |
---|---|---|---|---|
IPW | Yes | π correct | Yes | IPW variance |
Mass Imputation | No | m correct | Yes | MI variance |
DR | Yes/No | π or m correct | Yes | DR (sec 2) |
Combined | Yes | π or m or design correct | Yes | Weighted covariance |
This table summarizes the conditions and features for each estimator class as reflected in (Seaman et al., 7 Aug 2025).
In summary, doubly robust estimators for survey data integration enable consistent and efficient inference by combining inverse probability weighting and mass imputation. The presented variance and covariance formulas, as well as the optimal combination framework, facilitate rigorous uncertainty quantification and exploitation of all available information across probability and nonprobability samples, marking a significant advancement in the methodology of modern survey statistics.