CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks (1904.09483v3)

Published 20 Apr 2019 in cs.DB and cs.LG

Abstract: Data quality affects ML model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.

PDF Abstract

An Evaluation of Data Cleaning on Machine Learning Classification Tasks: A Detailed Examination of the CleanML Study

The paper "CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks" undertakes a comprehensive examination of how data cleaning processes influence the performance of ML models on classification tasks. It addresses a significant gap in the ML and database (DB) research communities, which have traditionally treated data cleaning and model robustness to dirty data as largely separate concerns.

Context and Motivation

Data quality is a critical factor in the performance of ML models. Despite this, there has been a lack of rigorous studies investigating the quantitative impact of data cleaning on downstream ML tasks. Previous work in the ML community has focused on developing algorithms robust to certain types of noise, while the DB community has explored data cleaning methods without considering their effect on ML performance. The authors argue that understanding the intersection of these two areas is crucial, given that data scientists spend a significant portion of their time preparing data through cleaning.

Methodology

The paper introduces CleanML, an extensible and open-source framework designed to systematically evaluate the impact of cleaning on ML models. The investigation includes:

Datasets and Error Types: The paper uses 14 real-world datasets, capturing five common error types: missing values, outliers, duplicates, inconsistencies, and mislabels.
ML Models and Cleaning Methods: Seven popular classification algorithms are evaluated, alongside multiple cleaning methods, ranging from simple practical solutions to state-of-the-art approaches found in academic literature.
Experimental Design: The paper employs statistical hypothesis testing to control experimental randomness and utilizes the Benjamini-Yekutieli procedure to control false discovery rates.

Key Findings

The research uncovers several significant insights:

Impact of Cleaning: The impact of cleaning on ML models is highly dependent on the specific error type and dataset. Cleaning missing values and mislabels generally yielded positive or insignificant impacts, whereas cleaning outliers and duplicates often resulted in insignificant improvements, or in some cases, negative impacts.
Model and Cleaning Algorithm Selection: Conducting model and cleaning algorithm selection prior to applying cleaning procedures often reduced negative impacts and, in some cases, further improved ML model performance.
Dataset-Specific Variability: The effect of cleaning varied significantly across datasets, underscoring the importance of dataset-specific considerations in the data cleaning process.

Comparative Analysis

The paper also compares traditional robust ML approaches with the CleanML approach, indicating that a thorough data cleaning process can outperform or complement robust ML methods. Additionally, human cleaning, where feasible, was often superior to automated cleaning processes, highlighting the potential benefits of expert involvement in data quality assurance.

Implications and Future Directions

The findings have both theoretical and practical implications. Practically, the CleanML framework provides a foundation for data scientists to make informed decisions about data cleaning strategies. Theoretically, the paper suggests several avenues for future research, including the development of improved cleaning algorithms tailored specifically for ML tasks and the establishment of a more robust theoretical framework for understanding the interaction between data cleaning and ML.

In conclusion, this work represents a substantial contribution to the understanding of how data quality interventions affect machine learning outcomes, encouraging a more integrated approach between data management and machine learning model development. The CleanML paper lays the groundwork for future advancements in both applied data science practices and theoretical research in the field.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Peng Li (390 papers)
Xi Rao (3 papers)
Jennifer Blase (1 paper)
Yue Zhang (618 papers)
Xu Chu (66 papers)
Ce Zhang (215 papers)

Citations (41)

View on Semantic Scholar