On the consistency of supervised learning with missing values (1902.06931v5)

Published 19 Feb 2019 in stat.ML, cs.LG, math.ST, and stat.TH

Abstract: In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data, through multiple imputation. Finally, to compare imputation with learning directly with a model that accounts for missing values, we analyze further decision trees. These can naturally tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing theoretically and empirically different missing values strategies in trees, we recommend using the "missing incorporated in attribute" method as it can handle both non-informative and informative missing values.

Authors (5)

Julie Josse (61 papers)
Nicolas Prost (2 papers)
Erwan Scornet (35 papers)
Gaël Varoquaux (87 papers)
Jacob M. Chen (4 papers)

Citations (103)

View on Semantic Scholar

Summary

The paper demonstrates that using test-time multiple imputation yields asymptotically optimal predictions in the presence of missing values.
The paper shows that imputing missing values with a constant is a simple yet consistent strategy for maintaining prediction accuracy.
The decision tree approach with MIA leverages missing data as an informative feature, enhancing empirical risk minimization.

Consistency of Supervised Learning with Missing Values

The paper "On the consistency of supervised learning with missing values" investigates the challenges and approaches in handling missing values within supervised learning frameworks. Traditional methods have primarily focused on estimating model parameters despite incomplete datasets, but this paper emphasizes prediction accuracy when missing values are present in both the training and test datasets.

The authors establish the consistency of two imputation methods in supervised learning: test-time multiple imputation and single imputation. A compelling result is the demonstration of consistency for the strategy of imputing missing values with a constant, which contrasts with inferential methods where this approach is often discouraged due to its impact on data distribution.

The paper introduces the "Missing Incorporated in Attribute" (MIA) method when using decision trees for empirical risk minimization, highlighting its efficacy in handling both informative and non-informative missing data. MIA is positioned as a robust approach to optimize prediction when dealing with missing data by treating missing values as a separate category during the partitioning phase of decision tree learning.

Key Insights

Test-Time Imputation Strategies: The paper underscores that multiple imputation during test time is a reliable approach when the goal is prediction with missing data. Conditional multiple imputation allows the integration of uncertainty in missing values, resulting in asymptotically optimal predictions.
Constant Imputation Consistency: The paper reveals that imputing missing values with a constant is consistent in a predictive context, aligning the imputation strategy between training and testing phases. This finding offers a practical and simple handling technique for missing values that maintains prediction accuracy, contrary to its criticized use in inferential statistics.
Decision Trees with MIA: Decision trees employing the MIA strategy emerge as a preferred choice for managing missing data, as they facilitate the incorporation of missingness as an informative feature. This approach strategically leverages the nature of missing data to enhance predictive modeling.

Implications for Practice and Theory

Practical Applications: The demonstrated consistency of constant imputation provides practitioners with an effective tool that is easily implemented and computationally efficient, allowing existing machine learning pipelines to adapt to missing data scenarios smoothly.
Theoretical Contributions: The paper contributes to theoretical understanding by linking imputation strategies directly to their impact on prediction loss, rather than merely data distribution, thus setting a foundation for further studies into the consistency of other straightforward imputation techniques.
Future Research Directions: The work invites exploration into additional models and methods that natively integrate missing data as an inherent feature rather than preprocessing it through imputation alone. Furthermore, it suggests advancing towards methods that simultaneously consider parameter estimation and prediction consistency in missing data contexts.

Overall, this paper provides valuable insights into the effectiveness of imputation methods in the presence of missing data, offering substantial evidence for the adoption of constant imputation and adaptive tree-based models like MIA in supervised learning tasks. These contributions not only bridge gaps in the existing literature but also establish a practical framework for machine learning with incomplete datasets.

PDF Markdown

Related Papers

GitHub

GitHub - dirty-data/supervised_missing: Code for the paper "On the consistency of supervised learning with missing values" (4 stars)

Tweets

https://twitter.com/StatMLPapers/status/1771025182775206170