HyperImpute: Generalized Iterative Imputation with Automatic Model Selection (2206.07769v1)

Published 15 Jun 2022 in stat.ML and cs.LG

Abstract: Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling benefit from the capacity and efficiency of learning with neural network function approximators, but are often difficult to optimize and rely on stronger data assumptions. In this work, we study an approach that marries the advantages of both: We propose HyperImpute, a generalized iterative imputation framework for adaptively and automatically configuring column-wise models and their hyperparameters. Practically, we provide a concrete implementation with out-of-the-box learners, optimizers, simulators, and extensible interfaces. Empirically, we investigate this framework via comprehensive experiments and sensitivities on a variety of public datasets, and demonstrate its ability to generate accurate imputations relative to a strong suite of benchmarks. Contrary to recent work, we believe our findings constitute a strong defense of the iterative imputation paradigm.

Citations (59)

View on Semantic Scholar

Summary

The paper introduces HyperImpute, a framework formalizing generalized iterative imputation with automatic model and hyperparameter selection to handle missing data.
HyperImpute provides a practical and extensible implementation with out-of-the-box components for reproducible empirical imputation research.
Empirical evaluations show HyperImpute achieves superior imputation performance compared to established benchmarks, especially with high missingness rates.

HyperImpute: Generalized Iterative Imputation with Automatic Model Selection

Missing data is a persistent challenge in the analysis of real-world datasets. The problem of imputing missing values is critical as it affects downstream data analysis and modeling. This paper introduces HyperImpute, a framework that leverages the strengths of both conventional iterative imputation techniques and advanced deep generative modeling approaches.

The Problem and Background

Imputation involves estimating missing values within incomplete datasets. Traditional iterative imputation methods use univariate models and estimate conditional distributions feature-by-feature, subsequently refining them in a round-robin fashion until convergence. These methods are versatile but require meticulous model specification and manual optimization, making them time-consuming. Conversely, recent deep generative models learn joint distributions well but hinge on stringent data assumptions which complicates optimization processes.

HyperImpute Framework

HyperImpute attempts to overcome the limitations of both paradigms by integrating automatic model selection within an iterative imputation framework. It adapts to varying missing data patterns without assuming independence between observed and unobserved data, a critique faced by generative models. HyperImpute performs column-wise model configurations for each feature, selecting models and hyperparameters optimally via AutoML techniques within the iterative imputation cycle. This facilitates handling mixed data types and optimizes imputation processes in situ.

Key Contributions

Formalization of Imputation Protocol: HyperImpute is structured around generalized iterative methods that optimize for automated hyperparameter selection alongside model configuration.
Practical Implementation: The paper provides a robust implementation featuring out-of-the-box learners, optimizers, simulators, and extensible interfaces, thereby easing reproducibility in empirical imputation research.
Empirical Evaluation: HyperImpute showcases superior performance relative to established benchmarks across varied experimental settings. Its adaptability in selecting optimal models is evidenced through comprehensive sensitivity analyses, highlighting its efficacy and broad applicability.

Empirical Investigation

Experiments compare HyperImpute’s performance against popular benchmarks including GAIN, MIWAE, MissForest, and others. HyperImpute consistently demonstrates enhanced imputations, notably excelling in scenarios with higher missingness rates. Ablation studies indicate significant contributions from its model selection capabilities, adaptive iteration strategies, and varieties of base learners. The model selection process reveals insights into preferential selection dynamics dependent on dataset characteristics, indicating the nuanced adaptability of HyperImpute.

Implications and Future Directions

HyperImpute’s approach enriches the paper of imputations by validating iterative methodologies when empowered by adaptive and ascertained model selections. This framework sets a precedent for leveraging iterative procedures with automated solutions to deliver scalable and high-fidelity imputations across diverse data scenarios. Future research may expand on incorporating causal mechanisms and exploring its applications in more complex, nonlinear imputation contexts.

Conclusion

The paper defends a well-configured iterative imputation paradigm as performant and practical for real-world applications. HyperImpute exemplifies an implementation where adaptability, automatic optimization, and methodological flexibility converge to advance the field of data imputation. It serves as a strong baseline for future explorations into hybrid imputation methodologies leveraging deep learning advancements with traditional statistical insights.

Related Papers

YouTube

Show All Videos