Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating tree-based imputation methods as an alternative to MICE PMM for drawing inference in empirical studies (2401.09602v1)

Published 17 Jan 2024 in stat.AP and stat.ML

Abstract: Dealing with missing data is an important problem in statistical analysis that is often addressed with imputation procedures. The performance and validity of such methods are of great importance for their application in empirical studies. While the prevailing method of Multiple Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM) is considered standard in the social science literature, the increase in complex datasets may require more advanced approaches based on machine learning. In particular, tree-based imputation methods have emerged as very competitive approaches. However, the performance and validity are not completely understood, particularly compared to the standard MICE PMM. This is especially true for inference in linear models. In this study, we investigate the impact of various imputation methods on coefficient estimation, Type I error, and power, to gain insights that can help empirical researchers deal with missingness more effectively. We explore MICE PMM alongside different tree-based methods, such as MICE with Random Forest (RF), Chained Random Forests with and without PMM (missRanger), and Extreme Gradient Boosting (MIXGBoost), conducting a realistic simulation study using the German National Educational Panel Study (NEPS) as the original data source. Our results reveal that Random Forest-based imputations, especially MICE RF and missRanger with PMM, consistently perform better in most scenarios. Standard MICE PMM shows partially increased bias and overly conservative test decisions, particularly with non-true zero coefficients. Our results thus underscore the potential advantages of tree-based imputation methods, albeit with a caveat that all methods perform worse with an increased missingness, particularly missRanger.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Olanrewaju Akande, Fan Li and Jerome Reiter “An Empirical Comparison of Multiple Imputation Methods for Categorical Data” In American Statistician 71.2, 2017, pp. 162–170 arXiv:1508.05918
  2. Leo Breiman “Random forest missing data algorithms” In Machine Learning 45, 2001, pp. 5–32 DOI: 10.1023/A:1010933404324
  3. Leo Breiman “Manual-Setting up, using, and understanding random forests V4.0”, 2003 URL: https://www.stat.berkeley.edu/~breiman/Using_random_forests_v4.0.pdf
  4. Lane F. Burgette and Jerome P. Reiter “Multiple imputation for missing data via sequential regression trees” In American Journal of Epidemiology 172.9, 2010, pp. 1070–1076 DOI: 10.1093/aje/kwq260
  5. “XGBoost: A scalable tree boosting system” In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 13-17-Augu, 2016, pp. 785–794 DOI: 10.1145/2939672.2939785
  6. Linda M. Collins, Joseph L. Schafer and Chi Ming Kam “A comparison of inclusive and restrictive strategies in modern missing data procedures” In Psychological Methods 6.3, 2001, pp. 330–351 DOI: 10.1037/1082-989x.6.4.330
  7. “Solving the many-variables problem in MICE with principal component regression” In Behavior Research Methods, 2023 DOI: 10.3758/s13428-023-02117-1
  8. Adele Cutler, D Richard Cutler and John R Stevens “Random Forests” In Ensemble Machine Learning: Methods and Applications New York, NY: Springer New York, 2012, pp. 157–175 DOI: 10.1007/978-1-4419-9326-7_5
  9. “Religion and Reactance to COVID-19 Mitigation Guidelines” In American Psychologist 76.5, 2021, pp. 744–754 DOI: 10.1037/amp0000717
  10. “Multiple Imputation Through XGBoost” In Journal of Computational and Graphical Statistics Taylor & Francis, 2023, pp. 1–19 DOI: 10.1080/10618600.2023.2252501
  11. L.L. Doove, S. Van Buuren and E. Dusseldorp “Recursive partitioning for missing data imputation in the presence of interaction effects” In Computational Statistics & Data Analysis 72, 2014, pp. 92–104 DOI: 10.1016/j.csda.2013.10.025
  12. Craig K. Enders and Davood Tofighi “Centering Predictor Variables in Cross-Sectional Multilevel Models: A New Look at an Old Issue” In Psychological Methods 12.2, 2007, pp. 121–138 DOI: 10.1037/1082-989X.12.2.121
  13. FDZ-LIfBi “Data Manual NEPS Starting Cohort 6– Adults, Adult Education and Lifelong Learning, Scientific Use File Version 13.0.0.”, 2022 Leibniz Institute for Educational Trajectories, National Educational Panel Study
  14. “On the role of benchmarking data sets and simulations in method comparison studies” In Biometrical Journal Wiley Online Library, 2023, pp. 2200212 DOI: https://doi.org/10.1002/bimj.202200212
  15. “Prevalence of questionable research practices, research misconduct and their potential explanatory factors: A survey among academic researchers in the Netherlands” In PLoS ONE 17.2 February, 2022, pp. 1–16 DOI: 10.1371/journal.pone.0263023
  16. Simon Grund, Oliver Lüdtke and Alexander Robitzsch “Multiple Imputation of Missing Data for Multilevel Models: Simulations and Recommendations” In Organizational Research Methods 21.1, 2018, pp. 111–149 DOI: 10.1177/1094428117703686
  17. Daniel B. Hajovsky, Steven R. Chesnut and Karissa M. Jensen “The role of teachers’ self-efficacy beliefs in the development of teacher-student relationships” In Journal of School Psychology 82.September Elsevier, 2020, pp. 141–158 DOI: 10.1016/j.jsp.2020.09.001
  18. Timothy Hayes “Investigating The Performance Of CART- And Random Forest-Based Procedures For Dealing With Longitudinal Dropout In Small Sample Designs Under MNAR Missing Data” In Longitudinal Multivariate Psychology New York: Routledge, 2018, pp. 212–239 DOI: 10.4324/9781315160542
  19. Timothy Hayes and John J. McArdle “Evaluating the performance of CART-based missing data methods under a missing not at random mechanism” In Multivariate Behavioral Research 52.1 Taylor & Francis, 2017, pp. 113–114 DOI: 10.1080/00273171.2016.1264287
  20. “Multiple Imputation Using Gaussian Copulas” In Sociological Methods and Research 50.3, 2021, pp. 1259–1283 DOI: 10.1177/0049124118799381
  21. Kristian Kleinke “Multiple Imputation Under Violated Distributional Assumptions: A Systematic Evaluation of the Assumed Robustness of Predictive Mean Matching” In Journal of Educational and Behavioral Statistics 42.4, 2017, pp. 371–404 DOI: 10.3102/1076998616687084
  22. “Univariate and bivariate geometric discrete generalized exponential distributions” In Journal of Statistical Theory and Practice 12 Springer, 2018, pp. 595–614 DOI: 10.1080/15598608.2018.1441082
  23. Roderick J.A. Little and Donald B. Rubin “Statistical analysis with missing data” Hoboken, NJ: Wiley & Sons, 2002
  24. Oliver Lüdtke, Alexander Robitzsch and Simon Grund “Multiple imputation of missing data in multilevel designs: A comparison of different strategies” In Psychological Methods 22.1, 2017, pp. 141–165 DOI: 10.1037/met0000096
  25. Michael Mayer “missRanger: Fast Imputation of Missing Values” In R package version 2.1.3, 2019, pp. 1–10 DOI: 10.1093/bioinformatics/btr597
  26. Tim P. Moriris, Ian R. White and Patrick Royston “Tuning multiple imputation by predictive mean matching and local residual draws” In BMC Medical Research Methodology 14.1, 2014 DOI: 10.1186/1471-2288-14-757
  27. Jared S. Murray “Multiple imputation: A review of practical and theoretical findings” In Statistical Science 33.2, 2018, pp. 142–159 DOI: 10.1214/18-STS644
  28. NEPS Network “National Educational Panel Study, Scientific Use File of Starting Cohort Adults.” Bamberg: Leibniz Institute for Educational Trajectories (LIfBi), 2022 DOI: 10.5157/NEPS:SC6:13.0.0
  29. RM Pickering “Describing the participants in a study” In Age and Ageing 46.4 Oxford University Press, 2017, pp. 576–581 DOI: 10.1093/ageing/afx054
  30. R Core Team “R: A Language and Environment for Statistical Computing”, 2022 R Foundation for Statistical Computing URL: https://www.R-project.org/
  31. “Predicting missing values: A comparative study on non-parametric approaches for imputation” In Computational Statistics 34 Springer, 2019, pp. 1741–1764 DOI: 10.1007/s00180-019-00900-3
  32. Burim Ramosaj, Lubna Amro and Markus Pauly “A cautionary tale on using imputation methods for inference in matched-pairs design” In Bioinformatics 36.10, 2020, pp. 3099–3106 DOI: 10.1093/bioinformatics/btaa082
  33. Christian P Robert “Simulation of truncated normal variables” In Statistics and computing 5 Springer, 1995, pp. 121–125 DOI: 10.1007/BF00143942
  34. Alberto Rotondi, Paolo Pedroni and Antonio Pievatolo “Probability, Statistics and Simulation: With Application Programs Written in R” Springer Nature, 2022
  35. D.B. Rubin “Multiple Imputation for Nonresponse in Surveys” New York, NY, US: Wiley, 1987
  36. “Machine learning models for identifying pre-frailty in community dwelling older adults” In BMC Geriatrics 22.1 BioMed Central, 2022, pp. 1–12 DOI: 10.1186/s12877-022-03475-9
  37. Joseph L. Schafer “Multiple imputation: A primer” In Statistical Methods in Medical Research 8.1, 1999, pp. 3–15 DOI: 10.1191/096228099671525676
  38. “Metropolitan, urban, and rural regions – How regional differences affect elementary school students in Germany” In Working paper, 2023 DOI: 10.2139/ssrn.4368170
  39. “Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study” In American Journal of Epidemiology 179.6, 2014, pp. 764–774 DOI: 10.1093/aje/kwt312
  40. Stephan Sommer “Local support of climate change policies in Germany over time” In Environmental Research Letters 115145.R1, 2023, pp. 1–19 DOI: 10.1088/1748-9326/acd406
  41. Daniel J. Stekhoven and Peter Bühlmann “Missforest-Non-parametric missing value imputation for mixed-type data” In Bioinformatics 28.1, 2012, pp. 112–118 DOI: 10.1093/bioinformatics/btr597
  42. “A comparison of imputation methods using machine learning models” In Communications for Statistical Applications and Methods 30.3, 2023, pp. 331–341 DOI: 10.29220/CSAM.2023.30.3.331
  43. “Goodness (of fit) of Imputation Accuracy: The GoodImpact Analysis” In Working paper, 2021 DOI: https://doi.org/10.48550/arXiv.2101.07532
  44. “How to Simulate Realistic Survival Data? A Simulation Study to Compare Realistic Simulation Models” In Working paper, 2023 DOI: https://doi.org/10.48550/arXiv.2308.07842
  45. “sampling: Survey Sampling” R package version 2.9, 2021 URL: https://CRAN.R-project.org/package=sampling
  46. Stef Buuren “Flexible imputation of missing data” Boca Raton, Florida: CRC press, 2018, pp. 1–405 URL: https://stefvanbuuren.name/fimd/
  47. “mice: Multivariate imputation by chained equations in R” In Journal of Statistical Software 45.3, 2011, pp. 1–67 DOI: 10.18637/jss.v045.i03
  48. “Rebutting Existing Misconceptions About Multiple Imputation as a Method for Handling Missing Data” In Journal of Personality Assessment 102.3 Routledge, 2020, pp. 297–308 DOI: 10.1080/00223891.2018.1530680
  49. Philip D. Waggoner “A batch process for high dimensional imputation” In Computational Statistics Springer Berlin Heidelberg, 2023 DOI: 10.1007/s00180-023-01325-9
  50. Christian Westermeier and Markus M. Grabka “Longitudinal wealth data and multiple imputation an evaluation study” In Survey Research Methods 10.3, 2016, pp. 237–252 DOI: 10.18148/srm/2016.v10i3.6387
  51. Marvin N. Wright and Andreas Ziegler “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R” In Journal of Statistical Software 77.1 Foundation for Open Access Statistic, 2017 DOI: 10.18637/jss.v077.i01
  52. “The Role of Personality in COVID-19-Related Perceptions, Evaluations, and Behaviors: Findings Across Five Samples, Nine Traits, and 17 Criteria” In Social Psychological and Personality Science 13.1, 2022, pp. 299–310 DOI: 10.1177/19485506211001680
  53. “Digitalisation – risk or chance for employees’ working conditions and well-being? A longitudinal analysis” In Unpublished, 2024
  54. Olanrewaju Akande, Fan Li and Jerome Reiter “An Empirical Comparison of Multiple Imputation Methods for Categorical Data” In American Statistician 71.2, 2017, pp. 162–170 DOI: 10.1080/00031305.2016.1277158
Citations (2)

Summary

We haven't generated a summary for this paper yet.