Doubly robust causal inference through penalized bias-reduced estimation: combining non-probability samples with designed surveys (2403.18039v1)
Abstract: Causal inference on the average treatment effect (ATE) using non-probability samples, such as electronic health records (EHR), faces challenges from sample selection bias and high-dimensional covariates. This requires considering a selection model alongside treatment and outcome models that are typical ingredients in causal inference. This paper considers integrating large non-probability samples with external probability samples from a design survey, addressing moderately high-dimensional confounders and variables that influence selection. In contrast to the two-step approach that separates variable selection and debiased estimation, we propose a one-step plug-in doubly robust (DR) estimator of the ATE. We construct a novel penalized estimating equation by minimizing the squared asymptotic bias of the DR estimator. Our approach facilitates ATE inference in high-dimensional settings by ignoring the variability in estimating nuisance parameters, which is not guaranteed in conventional likelihood approaches with non-differentiable L1-type penalties. We provide a consistent variance estimator for the DR estimator. Simulation studies demonstrate the double robustness of our estimator under misspecification of either the outcome model or the selection and treatment models, as well as the validity of statistical inference under penalized estimation. We apply our method to integrate EHR data from the Michigan Genomics Initiative with an external probability sample.
- High-dimensional inference for the average treatment effect under model misspecification using penalized bias-reduced double-robust estimation. Biostatistics & Epidemiology, pages 1–18.
- Data integration through outcome adaptive lasso and a collaborative propensity score approach. arXiv preprint arXiv:2103.15218.
- Summary report of the aapor task force on non-probability sampling. Journal of survey statistics and methodology, 1(2):90–143.
- Statistical inference for association studies using electronic health records: handling both selection bias and outcome misclassification. Biometrics, 78(1):214–226.
- The emerging landscape of health research based on biobanks linked to electronic health records: Existing resources, statistical challenges, and potential opportunities. Statistics in medicine, 39(6):773–800.
- Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies, 81(2):608–650.
- Local polynomial regression estimators in survey sampling. Annals of statistics, pages 1026–1053.
- Doubly robust inference with nonprobability survey samples. Journal of the American Statistical Association, 115(532):2011–2021.
- Double/debiased machine learning for treatment and structural parameters.
- Generic machine learning inference on heterogeneous treatment effects in randomized experiments, with an application to immunization in india. Technical report, National Bureau of Economic Research.
- Variable selection for doubly robust causal inference. arXiv preprint arXiv:2301.11094.
- Study designs for extending causal inferences from a randomized trial to a target population. American journal of epidemiology, 190(8):1632–1642.
- Generalizing trial findings using nested trial designs with sub-sampling of non-randomized individuals. arXiv preprint arXiv:1902.06080.
- Efficient and robust methods for causally interpretable meta-analysis: Transporting inferences from multiple randomized trials to a target population. Biometrics.
- Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics, 75(2):685–694.
- Calibration estimators in survey sampling. Journal of the American statistical Association, 87(418):376–382.
- Variance estimation for the regression imputed horvitz-thompson estimator. JOURNAL OF OFFICIAL STATISTICS-STOCKHOLM-, 10:381–381.
- Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360.
- Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics, 189(1):1–23.
- Fu, W. J. (2003). Penalized estimating equations. Biometrics, 59(1):126–132.
- Analysis of randomised trials with long-term follow-up. BMC Medical Research Methodology, 18:1–9.
- Observational health data sciences and informatics (ohdsi): opportunities for observational researchers. Studies in health technology and informatics, 216:574.
- Penalized estimating functions and variable selection in semiparametric regression models. Journal of the American Statistical Association, 103(482):672–680.
- Robust inference on the average treatment effect using the outcome highly adaptive lasso. Biometrics, 76(1):109–118.
- Variable selection in double/debiased machine learning for causal inference: an outcome-adaptive approach. Communications in Statistics-Simulation and Computation, pages 1–14.
- Defining key design elements of registry-based randomised controlled trials: a scoping review. Trials, 21:1–22.
- Kott, P. S. (1990). Estimating the conditional variance of a design consistent regression estimator. Journal of Statistical Planning and Inference, 24(3):287–296.
- Rationale and design of the novel uses of adaptive designs to guide provider engagement in electronic health records (nudge-ehr) pragmatic adaptive randomized trial: a trial protocol. Implementation Science, 16(1):1–11.
- Routinely collected data for randomized trials: promises, barriers, and implications. Trials, 19:1–9.
- Model assisted survey sampling. Springer Science & Business Media.
- Variance estimation for survey data with composite imputation and nonnegligible sampling fractions. Journal of the American Statistical Association, 94(445):254–265.
- Data integration in causal inference. Wiley Interdisciplinary Reviews: Computational Statistics, page e1581.
- Outcome-adaptive lasso: variable selection for causal inference. Biometrics, 73(4):1111–1122.
- The calculus of m-estimation. The American Statistician, 56(1):29–38.
- Comprehensive comparative effectiveness and safety of first-line antihypertensive drug classes: a systematic, multinational, large-scale analysis. The Lancet, 394(10211):1816–1826.
- Model-assisted calibration of non-probability sample survey data using adaptive lasso. Survey Methodology, 44(1):117–145.
- Bias-reduced doubly robust estimation. Journal of the American Statistical Association, 110(511):1024–1036.
- Wellner, J. et al. (2013). Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media.
- Validating drug repurposing signals using electronic health records: a case study of metformin associated with reduced cancer mortality. Journal of the American Medical Informatics Association, 22(1):179–191.
- Doubly robust inference when combining probability and non-probability samples with high dimensional data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(2):445–465.