Dice Question Streamline Icon: https://streamlinehq.com

Reliability of statistical inference after tree-based imputation

Determine whether statistical inference—specifically the validity of hypothesis testing and parameter estimation—is reliable when missing values are imputed using tree-based multiple imputation methods in empirical social science datasets. The purpose is to ascertain if standard inferential procedures yield trustworthy Type I error control and power when the imputation models are tree-based rather than parametric.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper contrasts the widely used MICE with Predictive Mean Matching (PMM) against emerging tree-based imputation methods (e.g., Random Forest via MICE RF and missRanger, and Extreme Gradient Boosting via mixgb). While tree-based approaches can better capture complex, nonlinear relationships and handle mixed data types, prior evidence (e.g., with missForest) indicated potential inflation of Type I error in certain designs.

Given the increasing adoption of methods like missRanger in empirical studies, the authors highlight the unresolved question of whether statistical inference remains reliable when such tree-based imputation is used. This motivates their simulation paper assessing coefficient bias, Type I error control, and power across multiple methods and missingness scenarios.

References

Despite, e.g., missRanger's growing use in empirical studies, a critical question remains unanswered: is statistical inference reliable for data imputed using tree-based methods? This is important because a predecessor of missRanger, the original missForest which does not allow for predictive mean matching, has led to inflated Type I errors for specific designs in previous research.

Evaluating tree-based imputation methods as an alternative to MICE PMM for drawing inference in empirical studies (2401.09602 - Schwerter et al., 17 Jan 2024) in Main text (Introduction), after discussion of tree-based methods; pages 2–3