Evaluating tree-based imputation methods as an alternative to MICE PMM for drawing inference in empirical studies (2401.09602v1)
Abstract: Dealing with missing data is an important problem in statistical analysis that is often addressed with imputation procedures. The performance and validity of such methods are of great importance for their application in empirical studies. While the prevailing method of Multiple Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM) is considered standard in the social science literature, the increase in complex datasets may require more advanced approaches based on machine learning. In particular, tree-based imputation methods have emerged as very competitive approaches. However, the performance and validity are not completely understood, particularly compared to the standard MICE PMM. This is especially true for inference in linear models. In this study, we investigate the impact of various imputation methods on coefficient estimation, Type I error, and power, to gain insights that can help empirical researchers deal with missingness more effectively. We explore MICE PMM alongside different tree-based methods, such as MICE with Random Forest (RF), Chained Random Forests with and without PMM (missRanger), and Extreme Gradient Boosting (MIXGBoost), conducting a realistic simulation study using the German National Educational Panel Study (NEPS) as the original data source. Our results reveal that Random Forest-based imputations, especially MICE RF and missRanger with PMM, consistently perform better in most scenarios. Standard MICE PMM shows partially increased bias and overly conservative test decisions, particularly with non-true zero coefficients. Our results thus underscore the potential advantages of tree-based imputation methods, albeit with a caveat that all methods perform worse with an increased missingness, particularly missRanger.
- Olanrewaju Akande, Fan Li and Jerome Reiter “An Empirical Comparison of Multiple Imputation Methods for Categorical Data” In American Statistician 71.2, 2017, pp. 162–170 arXiv:1508.05918
- Leo Breiman “Random forest missing data algorithms” In Machine Learning 45, 2001, pp. 5–32 DOI: 10.1023/A:1010933404324
- Leo Breiman “Manual-Setting up, using, and understanding random forests V4.0”, 2003 URL: https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf
- Lane F. Burgette and Jerome P. Reiter “Multiple imputation for missing data via sequential regression trees” In American Journal of Epidemiology 172.9, 2010, pp. 1070–1076 DOI: 10.1093/aje/kwq260
- “XGBoost: A scalable tree boosting system” In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 13-17-Augu, 2016, pp. 785–794 DOI: 10.1145/2939672.2939785
- Linda M. Collins, Joseph L. Schafer and Chi Ming Kam “A comparison of inclusive and restrictive strategies in modern missing data procedures” In Psychological Methods 6.3, 2001, pp. 330–351 DOI: 10.1037/1082-989x.6.4.330
- “Solving the many-variables problem in MICE with principal component regression” In Behavior Research Methods, 2023 DOI: 10.3758/s13428-023-02117-1
- Adele Cutler, D Richard Cutler and John R Stevens “Random Forests” In Ensemble Machine Learning: Methods and Applications New York, NY: Springer New York, 2012, pp. 157–175 DOI: 10.1007/978-1-4419-9326-7_5
- “Religion and Reactance to COVID-19 Mitigation Guidelines” In American Psychologist 76.5, 2021, pp. 744–754 DOI: 10.1037/amp0000717
- “Multiple Imputation Through XGBoost” In Journal of Computational and Graphical Statistics Taylor & Francis, 2023, pp. 1–19 DOI: 10.1080/10618600.2023.2252501
- L.L. Doove, S. Van Buuren and E. Dusseldorp “Recursive partitioning for missing data imputation in the presence of interaction effects” In Computational Statistics & Data Analysis 72, 2014, pp. 92–104 DOI: 10.1016/j.csda.2013.10.025
- Craig K. Enders and Davood Tofighi “Centering Predictor Variables in Cross-Sectional Multilevel Models: A New Look at an Old Issue” In Psychological Methods 12.2, 2007, pp. 121–138 DOI: 10.1037/1082-989X.12.2.121
- FDZ-LIfBi “Data Manual NEPS Starting Cohort 6– Adults, Adult Education and Lifelong Learning, Scientific Use File Version 13.0.0.”, 2022 Leibniz Institute for Educational Trajectories, National Educational Panel Study
- “On the role of benchmarking data sets and simulations in method comparison studies” In Biometrical Journal Wiley Online Library, 2023, pp. 2200212 DOI: https://doi.org/10.1002/bimj.202200212
- “Prevalence of questionable research practices, research misconduct and their potential explanatory factors: A survey among academic researchers in the Netherlands” In PLoS ONE 17.2 February, 2022, pp. 1–16 DOI: 10.1371/journal.pone.0263023
- Simon Grund, Oliver Lüdtke and Alexander Robitzsch “Multiple Imputation of Missing Data for Multilevel Models: Simulations and Recommendations” In Organizational Research Methods 21.1, 2018, pp. 111–149 DOI: 10.1177/1094428117703686
- Daniel B. Hajovsky, Steven R. Chesnut and Karissa M. Jensen “The role of teachers’ self-efficacy beliefs in the development of teacher-student relationships” In Journal of School Psychology 82.September Elsevier, 2020, pp. 141–158 DOI: 10.1016/j.jsp.2020.09.001
- Timothy Hayes “Investigating The Performance Of CART- And Random Forest-Based Procedures For Dealing With Longitudinal Dropout In Small Sample Designs Under MNAR Missing Data” In Longitudinal Multivariate Psychology New York: Routledge, 2018, pp. 212–239 DOI: 10.4324/9781315160542
- Timothy Hayes and John J. McArdle “Evaluating the performance of CART-based missing data methods under a missing not at random mechanism” In Multivariate Behavioral Research 52.1 Taylor & Francis, 2017, pp. 113–114 DOI: 10.1080/00273171.2016.1264287
- “Multiple Imputation Using Gaussian Copulas” In Sociological Methods and Research 50.3, 2021, pp. 1259–1283 DOI: 10.1177/0049124118799381
- Kristian Kleinke “Multiple Imputation Under Violated Distributional Assumptions: A Systematic Evaluation of the Assumed Robustness of Predictive Mean Matching” In Journal of Educational and Behavioral Statistics 42.4, 2017, pp. 371–404 DOI: 10.3102/1076998616687084
- “Univariate and bivariate geometric discrete generalized exponential distributions” In Journal of Statistical Theory and Practice 12 Springer, 2018, pp. 595–614 DOI: 10.1080/15598608.2018.1441082
- Roderick J.A. Little and Donald B. Rubin “Statistical analysis with missing data” Hoboken, NJ: Wiley & Sons, 2002
- Oliver Lüdtke, Alexander Robitzsch and Simon Grund “Multiple imputation of missing data in multilevel designs: A comparison of different strategies” In Psychological Methods 22.1, 2017, pp. 141–165 DOI: 10.1037/met0000096
- Michael Mayer “missRanger: Fast Imputation of Missing Values” In R package version 2.1.3, 2019, pp. 1–10 DOI: 10.1093/bioinformatics/btr597
- Tim P. Moriris, Ian R. White and Patrick Royston “Tuning multiple imputation by predictive mean matching and local residual draws” In BMC Medical Research Methodology 14.1, 2014 DOI: 10.1186/1471-2288-14-757
- Jared S. Murray “Multiple imputation: A review of practical and theoretical findings” In Statistical Science 33.2, 2018, pp. 142–159 DOI: 10.1214/18-STS644
- NEPS Network “National Educational Panel Study, Scientific Use File of Starting Cohort Adults.” Bamberg: Leibniz Institute for Educational Trajectories (LIfBi), 2022 DOI: 10.5157/NEPS:SC6:13.0.0
- RM Pickering “Describing the participants in a study” In Age and Ageing 46.4 Oxford University Press, 2017, pp. 576–581 DOI: 10.1093/ageing/afx054
- R Core Team “R: A Language and Environment for Statistical Computing”, 2022 R Foundation for Statistical Computing URL: https://www.R-project.org/
- “Predicting missing values: A comparative study on non-parametric approaches for imputation” In Computational Statistics 34 Springer, 2019, pp. 1741–1764 DOI: 10.1007/s00180-019-00900-3
- Burim Ramosaj, Lubna Amro and Markus Pauly “A cautionary tale on using imputation methods for inference in matched-pairs design” In Bioinformatics 36.10, 2020, pp. 3099–3106 DOI: 10.1093/bioinformatics/btaa082
- Christian P Robert “Simulation of truncated normal variables” In Statistics and computing 5 Springer, 1995, pp. 121–125 DOI: 10.1007/BF00143942
- Alberto Rotondi, Paolo Pedroni and Antonio Pievatolo “Probability, Statistics and Simulation: With Application Programs Written in R” Springer Nature, 2022
- D.B. Rubin “Multiple Imputation for Nonresponse in Surveys” New York, NY, US: Wiley, 1987
- “Machine learning models for identifying pre-frailty in community dwelling older adults” In BMC Geriatrics 22.1 BioMed Central, 2022, pp. 1–12 DOI: 10.1186/s12877-022-03475-9
- Joseph L. Schafer “Multiple imputation: A primer” In Statistical Methods in Medical Research 8.1, 1999, pp. 3–15 DOI: 10.1191/096228099671525676
- “Metropolitan, urban, and rural regions – How regional differences affect elementary school students in Germany” In Working paper, 2023 DOI: 10.2139/ssrn.4368170
- “Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study” In American Journal of Epidemiology 179.6, 2014, pp. 764–774 DOI: 10.1093/aje/kwt312
- Stephan Sommer “Local support of climate change policies in Germany over time” In Environmental Research Letters 115145.R1, 2023, pp. 1–19 DOI: 10.1088/1748-9326/acd406
- Daniel J. Stekhoven and Peter Bühlmann “Missforest-Non-parametric missing value imputation for mixed-type data” In Bioinformatics 28.1, 2012, pp. 112–118 DOI: 10.1093/bioinformatics/btr597
- “A comparison of imputation methods using machine learning models” In Communications for Statistical Applications and Methods 30.3, 2023, pp. 331–341 DOI: 10.29220/CSAM.2023.30.3.331
- “Goodness (of fit) of Imputation Accuracy: The GoodImpact Analysis” In Working paper, 2021 DOI: https://doi.org/10.48550/arXiv.2101.07532
- “How to Simulate Realistic Survival Data? A Simulation Study to Compare Realistic Simulation Models” In Working paper, 2023 DOI: https://doi.org/10.48550/arXiv.2308.07842
- “sampling: Survey Sampling” R package version 2.9, 2021 URL: https://CRAN.R-project.org/package=sampling
- Stef Buuren “Flexible imputation of missing data” Boca Raton, Florida: CRC press, 2018, pp. 1–405 URL: https://stefvanbuuren.name/fimd/
- “mice: Multivariate imputation by chained equations in R” In Journal of Statistical Software 45.3, 2011, pp. 1–67 DOI: 10.18637/jss.v045.i03
- “Rebutting Existing Misconceptions About Multiple Imputation as a Method for Handling Missing Data” In Journal of Personality Assessment 102.3 Routledge, 2020, pp. 297–308 DOI: 10.1080/00223891.2018.1530680
- Philip D. Waggoner “A batch process for high dimensional imputation” In Computational Statistics Springer Berlin Heidelberg, 2023 DOI: 10.1007/s00180-023-01325-9
- Christian Westermeier and Markus M. Grabka “Longitudinal wealth data and multiple imputation an evaluation study” In Survey Research Methods 10.3, 2016, pp. 237–252 DOI: 10.18148/srm/2016.v10i3.6387
- Marvin N. Wright and Andreas Ziegler “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R” In Journal of Statistical Software 77.1 Foundation for Open Access Statistic, 2017 DOI: 10.18637/jss.v077.i01
- “The Role of Personality in COVID-19-Related Perceptions, Evaluations, and Behaviors: Findings Across Five Samples, Nine Traits, and 17 Criteria” In Social Psychological and Personality Science 13.1, 2022, pp. 299–310 DOI: 10.1177/19485506211001680
- “Digitalisation – risk or chance for employees’ working conditions and well-being? A longitudinal analysis” In Unpublished, 2024
- Olanrewaju Akande, Fan Li and Jerome Reiter “An Empirical Comparison of Multiple Imputation Methods for Categorical Data” In American Statistician 71.2, 2017, pp. 162–170 DOI: 10.1080/00031305.2016.1277158