Random Forest Missing Data Algorithms (1701.05305v2)

Published 19 Jan 2017 in stat.ML

Abstract: Random forest (RF) missing data algorithms are an attractive approach for dealing with missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms but relatively little guidance about their efficacy, which motivated us to study their performance. Using a large, diverse collection of data sets, performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting---the latter class representing a generalization of a new promising imputation algorithm called missForest. Performance of algorithms was assessed by ability to impute data accurately. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.

Citations (481)

View on Semantic Scholar

Summary

The paper presents a comprehensive experimental evaluation of RF imputation methods, revealing strong correlation-dependent performance.
It introduces mForest, a multivariate extension of missForest that reduces computational costs in handling high-dimensional data.
Findings suggest that RF imputation can outperform k-nearest neighbor methods in low to medium correlation settings.

Overview of Random Forest Missing Data Algorithms

The paper authored by Fei Tang and Hemant Ishwaran provides a comprehensive evaluation of Random Forest (RF) missing data algorithms. Recognizing the disadvantages of discarding data when missingness occurs, which leads to potential loss of information and bias, the authors investigate the efficacy of several RF-based imputation techniques. The paper is motivated by the current lack of guidance regarding the performance of these algorithms, despite their advantageous ability to handle mixed data types and complex interactions in high-dimensional settings.

The paper investigates the performance of different RF imputation strategies using a wide-ranging dataset collection and various missing data mechanisms: Missing Completely at Random (MCAR), Missing at Random (MAR), and Not Missing at Random (NMAR). Key strategies considered include proximity imputation, on-the-fly imputation, and methods utilizing unsupervised and multivariate splitting, notably extending the missForest algorithm.

Key Findings and Contributions

The authors present an extensive experimental paper involving 60 diverse datasets with varying dimensions and correlation structures. This empirical assessment ranks the imputation algorithms in terms of their accuracy and computational speed. A pivotal finding is the strong correlation-dependent performance of RF algorithms. High correlation enhances the efficacy of all assessed algorithms, markedly benefiting the missForest algorithm.

The paper proposes a multivariate version of missForest, named mForest, which reduces computational burden through variable grouping and multivariate forest regression. The algorithm achieved appreciable computational savings over missForest—a result that holds promise for practical applications involving large datasets.

Significantly, the research highlights the potential of RF imputation methods to outperform established techniques like the $k$ -nearest neighbor (KNN) method, especially under medium to low correlation environments. This advantage is attributed to RF's adaptive nearest neighbor nature, providing robustness in complex data interactions.

Theoretical and Practical Implications

Theoretically, this work strengthens the understanding of RF as an effective tool in data imputation, offering methodological insights into enhancing algorithm scalability and precision in challenging data scenarios. Practically, the implications extend to fields dealing with incomplete data in genomic, neuroimaging, and other high-throughput domains, where maintaining integrity of data analysis is critical.

Speculation on Future Developments

Looking forward, integrating the computational efficiencies of mForest into more native implementations, leveraging parallel processing capabilities, could further expedite its adoption in large-scale analyses. Moreover, expanding the scope of mForest to seamlessly incorporate multivariate unsupervised splitting may yield additional performance gains.

In the broader field of AI and machine learning, the work underscores the necessity of adaptive algorithms capable of maintaining predictive reliability amidst incomplete datasets, a recurrent challenge across various scientific and practical domains. Future studies may investigate integrating RF imputation approaches with neural network architectures to address highly nonlinear data relationships further.

In conclusion, the paper delivers rigorous empirical evaluations, offering valuable insights and practical advancements in applying RF algorithms for missing data imputation, thereby contributing significantly to both academic research and practical machine learning applications.

PDF Markdown