Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Causal machine learning methods and use of sample splitting in settings with high-dimensional confounding (2405.15242v2)

Published 24 May 2024 in stat.ME

Abstract: Observational epidemiological studies commonly seek to estimate the causal effect of an exposure on an outcome. Adjustment for potential confounding bias in modern studies is challenging due to the presence of high-dimensional confounding, which occurs when there are many confounders relative to sample size or complex relationships between continuous confounders and exposure and outcome. Despite recent advances, limited evaluation, and guidance are available on the implementation of doubly robust methods, Augmented Inverse Probability Weighting (AIPW) and Targeted Maximum Likelihood Estimation (TMLE), with data-adaptive approaches and cross-fitting in realistic settings where high-dimensional confounding is present. Motivated by an early-life cohort study, we conducted an extensive simulation study to compare the relative performance of AIPW and TMLE using data-adaptive approaches in estimating the average causal effect (ACE). We evaluated the benefits of using cross-fitting with a varying number of folds, as well as the impact of using a reduced versus full (larger, more diverse) library in the Super Learner ensemble learning approach used for implementation. We found that AIPW and TMLE performed similarly in most cases for estimating the ACE, but TMLE was more stable. Cross-fitting improved the performance of both methods, but was more important for estimation of standard error and coverage than for point estimates, with the number of folds a less important consideration. Using a full Super Learner library was important to reduce bias and variance in complex scenarios typical of modern health research studies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Hyperparameter Tuning for Causal Inference with Double Machine Learning: A Simulation Study. ArXiv URL: http://arxiv.org/abs/2402.04674.
  2. Invited Commentary: Demystifying Statistical Inference When Using Machine Learning in Causal Research. American Journal of Epidemiology 192, 1545–1549. URL: https://academic.oup.com/aje/advance-article/doi/10.1093/aje/kwab200/6322278, doi:10.1093/aje/kwab200.
  3. Doubly robust estimation in missing data and causal inference models. Biometrics 61, 962–973. doi:10.1111/j.1541-0420.2005.00377.x.
  4. Doubly robust nonparametric inference on the average treatment effect. Biometrika 104, 863–880. URL: internal-pdf://121.157.2.71/Benkeser_biometrica_2017https://doi.org/10.1093/biomet/asx053, doi:10.1093/biomet/asx053.
  5. Resampling fewer than n observations: Gains, losses, and remedies. Institute of Statistical Science, Academia Sinica 7, 1–31. doi:10.2307/26432490.
  6. Random Forests. Machine Learning 45, 5–32.
  7. MACHINE LEARNING IN ECONOMETRICS Double/Debiased/Neyman Machine Learning of Treatment Effects. American Economic Review 107, 261–265. URL: internal-pdf://119.70.95.1/Chernozhukov_AmerEcoReview_2017.pdf, doi:10.1257/aer.p20171038.
  8. Double/debiased machine learning for treatment and structural parameters. Econometrics Journal 21, C1–C68. URL: internal-pdf://222.136.74.220/Chernozhukov_EconJ_2018.pdf, doi:10.1111/ectj.12097.
  9. Locally Robust Semiparametric Estimation. Technical Report.
  10. Glycoprotein Acetyls: A Novel Inflammatory Biomarker of Early Cardiovascular Risk in the Young. Journal of the American Heart Association 11. doi:10.1161/JAHA.121.024380.
  11. Double Robustness, in: Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd, Chichester, UK, pp. 1–14. URL: http://doi.wiley.com/10.1002/9781118445112.stat08068, doi:10.1002/9781118445112.stat08068.
  12. Handling missing data when estimating causal effects with Targeted Maximum Likelihood Estimation. American Journal of Epidemiology Epub ahead of print. URL: https://academic.oup.com/aje/advance-article/doi/10.1093/aje/kwae012/7612961, doi:10.1093/aje/kwae012/7612961.
  13. The obesity paradox in critically ill patients: a causal learning approach to a casual finding. Critical care (London, England) 24, 485. URL: https://ccforum.biomedcentral.com/articles/10.1186/s13054-020-03199-5, doi:10.1186/s13054-020-03199-5.
  14. Inflammation in heart disease: do researchers know enough? Nature 594. doi:10.1038/d41586-021-01453-6.
  15. Machine learning in the estimation of causal effects: targeted minimum loss-based estimation and double/debiased machine learning. Biostatistics 21, 353–358. URL: internal-pdf://154.50.35.75/03Diaz_Biostatistics_2019.pdfhttps://doi.org/10.1093/biostatistics/kxz042, doi:10.1093/biostatistics/kxz042.
  16. How to obtain valid tests and confidence intervals after propensity score variable selection? Statistical Methods in Medical Research. 29, 677–694. URL: https://journals.sagepub.com/doi/abs/10.1177/0962280219862005?casa_token=Q7kdnqXY7z8AAAAA%3AHy0xM3fTX3xh8a47LKxP0NUTo6wQ9wdEj5b7KDCLH85GGXXjDcnmKezspGXtQuBU8F5cE9AktPiv, doi:10.1177/0962280219862005.
  17. An introduction to the augmented inverse propensity weighted estimator. Political Analysis 18, 36–56. URL: /core/journals/political-analysis/article/an-introduction-to-the-augmented-inverse-propensity-weighted-estimator/4B1B8301E46F4432C4DCC91FE20780DB, doi:10.1093/pan/mpp036.
  18. Quantitative assessment of unobserved confounding is mandatory in nonrandomized intervention studies. Journal of Clinical Epidemiology 62, 22–28. doi:10.1016/J.JCLINEPI.2008.02.011.
  19. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer.
  20. A definition of causal effect for epidemiological research. J Epidemiol Community Health 58, 265–271. URL: http://jech.bmj.com/, doi:10.1136/jech.2002.006361.
  21. Causal Inference: What If.
  22. Estimating the Causal Impact of Proximity to Gold and Copper Mines on Respiratory Diseases in Chilean Children: An Application of Targeted Maximum Likelihood Estimation. International Journal of Environmental Research and Public Health 15, 15. URL: internal-pdf://104.74.72.26/Herrera_IJERP_2018.pdf, doi:10.3390/ijerph15010039.
  23. Demystifying Statistical Learning Based on Efficient Influence Functions. The American Statistician 76, 292–304. URL: https://www.tandfonline.com/doi/abs/10.1080/00031305.2021.2021984, doi:10.1080/00031305.2021.2021984/SUPPL{\_}FILE/UTAS{\_}A{\_}2021984{\_}SM3141.PDF.
  24. Cross-Fitting and Averaging for Machine Learning Estimation of Heterogeneous Treatment Effects. arXiv .
  25. Semiparametric Theory and Empirical Processes in Causal Inference, in: He, H., Wu, P., Chen, D.G.D. (Eds.), Statistical Causal Inferences and Their Applications in Public Health Research. Springer International Publishing, Cham, pp. 141–167. URL: https://doi.org/10.1007/978-3-319-41259-7_8, doi:10.1007/978-3-319-41259-7{\_}8.
  26. Machine Learning in Policy Evaluation: New Tools for Causal Inference, in: Oxford Research Encyclopedia of Economics and Finance. Oxford University Press. URL: https://oxfordre.com/economics/view/10.1093/acrefore/9780190625979.001.0001/acrefore-9780190625979-e-256, doi:10.1093/acrefore/9780190625979.013.256.
  27. Applied predictive modeling. Springer New York. doi:10.1007/978-1-4614-6849-3.
  28. Targeted maximum likelihood learning. International Journal of Biostatistics 2. doi:10.2202/1557-4679.1043.
  29. Can one estimate the conditional distribution of post-model-selection estimators? The Annals of Statistics 34, 2554–2591. URL: https://projecteuclid.org/journals/annals-of-statistics/volume-34/issue-5/Can-one-estimate-the-conditional-distribution-of-post-model-selection/10.1214/009053606000000821.fullhttps://projecteuclid.org/journals/annals-of-statistics/volume-34/issue-5/Can-one-estimate-the-conditional-distribution-of-post-model-selection/10.1214/009053606000000821.short, doi:10.1214/009053606000000821.
  30. Targeted maximum likelihood estimation in safety analysis. Journal of Clinical Epidemiology 66, S91–S98. URL: http://dx.doi.org/10.1016/j.jclinepi.2013.02.017, doi:10.1016/j.jclinepi.2013.02.017.
  31. Evaluating the Robustness of Targeted Maximum Likelihood Estimators via Realistic Simulations in Nutrition Intervention Trials. Statistics in Medicine 41, 2132–2165.
  32. Targeted maximum likelihood estimation for a binary treatment: A tutorial. Statistics in Medicine 37, 2530–2546. URL: internal-pdf://0603570492/Luque-Fernandez_StatMed_2018.pdf, doi:10.1002/sim.7628.
  33. REFINE2: A tool to evaluate real-world performance of machine-learning based effect estimators for molecular and clinical studies. ArXiv .
  34. Using simulation studies to evaluate statistical methods. Statistics in Medicine 38, 2074–2102. doi:10.1002/sim.8086.
  35. Challenges in Obtaining Valid Causal Effect Estimates with Machine Learning Algorithms. American Journal of Epidemiology 192, 1536–1544. URL: https://doi.org/10.1093/aje/kwab201, doi:10.1093/AJE/KWAB201.
  36. Cross-Fitting and Fast Remainder Rates for Semiparametric Estimation. arXiv URL: http://arxiv.org/abs/1801.09138.
  37. Practical considerations for specifying a super learner. International Journal of Epidemiology 52, 1276–1285. URL: https://arxiv.org/abs/2204.06139v2, doi:10.48550/arxiv.2204.06139.
  38. Comment: Performance of Double-Robust Estimators When “Inverse Probability” Weights Are Highly Variable. Statistical Science 22, 523–539. doi:10.1214/07-STS227.
  39. Estimation of Regression Coefficients When Some Regressors are not Always Observed. Journal of the American Statistical Association 89. URL: https://www.tandfonline.com/action/journalInformation?journalCode=uasa20, doi:10.1080/01621459.1994.10476818.
  40. Targeted Learning: Causal Inference for Observational and Experimental Data. doi:10.1007/978-1-4419-9782-1.
  41. Machine learning for causal inference in Biostatistics. Biostatistics 21, 336–338. URL: internal-pdf://0515530019/00Rose_Rizopolous_Biostatistics_2019.pdfhttps://doi.org/10.1093/biostatistics/kxz045, doi:10.1093/biostatistics/kxz045.
  42. Causal Inference Using Potential Outcomes: Design, Modeling, Decisions. Source: Journal of the American Statistical Association 100, 322–331. doi:10.1198/01621450400000188O.
  43. Targeted Maximum Likelihood Estimation for Causal Inference in Observational Studies. American Journal of Epidemiology 185, 65–73. URL: internal-pdf://87.228.204.175/Schuler_AmJE_2017.pdf, doi:10.1093/aje/kww165.
  44. Accuracy of Pulse Wave Velocity Predicting Cardiovascular and All-Cause Mortality. A Systematic Review and Meta-Analysis. Journal of Clinical Medicine 9, 2080. URL: https://pubmed.ncbi.nlm.nih.gov/32630671/, doi:10.3390/jcm9072080.
  45. Inflammation and cardiovascular diseases: The most recent findings. International Journal of Molecular Sciences 20, 5–8. doi:10.3390/ijms20163879.
  46. A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences of the United States of America 116, 14516–14525. doi:10.1073/pnas.1810420116.
  47. Elevated aortic pulse wave velocity, a marker of arterial stiffness, predicts cardiovascular events in well-functioning older adults. Circulation 111, 3384–3390. URL: https://www.ahajournals.org/doi/10.1161/CIRCULATIONAHA.104.483628, doi:10.1161/CIRCULATIONAHA.104.483628.
  48. Regression Shrinkage and Selection via the Lasso. Technical Report 1.
  49. Higher Order Tangent Spaces and Influence Functions. Statistical Science 29, 679–686. URL: https://projecteuclid.org/journals/statistical-science/volume-29/issue-4/Higher-Order-Tangent-Spaces-and-Influence-Functions/10.1214/14-STS478.full, doi:10.1214/14-STS478.
  50. Principles of confounder selection. European Journal of Epidemiology 34, 211–219. URL: https://doi.org/10.1007/s10654-019-00494-6, doi:10.1007/s10654-019-00494-6.
  51. Cohort Profile: The Barwon Infant Study. International journal of epidemiology 44, 1148–1160. URL: https://pubmed.ncbi.nlm.nih.gov/25829362/, doi:10.1093/IJE/DYV026.
  52. STATS 361: Causal Inference. Technical Report. Stanford University.
  53. Cross-Validated Targeted Minimum-Loss-Based Estimation. doi:10.1007/978-1-4419-9782-1{\_}27.
  54. Ensemble Methods, Foundations and Algorithms.
  55. Machine learning for causal inference: On the use of cross-fit estimators. Epidemiology 32, 393–401. doi:10.1097/EDE.0000000000001332.
Citations (1)

Summary

We haven't generated a summary for this paper yet.