Leakage and the Reproducibility Crisis in ML-based Science (2207.07048v1)

Published 14 Jul 2022 in cs.LG, cs.AI, and stat.ME

Abstract: The use of ML methods for prediction and forecasting has become widespread across the quantitative sciences. However, there are many known methodological pitfalls, including data leakage, in ML-based science. In this paper, we systematically investigate reproducibility issues in ML-based science. We show that data leakage is indeed a widespread problem and has led to severe reproducibility failures. Specifically, through a survey of literature in research communities that adopted ML methods, we find 17 fields where errors have been found, collectively affecting 329 papers and in some cases leading to wildly overoptimistic conclusions. Based on our survey, we present a fine-grained taxonomy of 8 types of leakage that range from textbook errors to open research problems. We argue for fundamental methodological changes to ML-based science so that cases of leakage can be caught before publication. To that end, we propose model info sheets for reporting scientific claims based on ML models that would address all types of leakage identified in our survey. To investigate the impact of reproducibility errors and the efficacy of model info sheets, we undertake a reproducibility study in a field where complex ML models are believed to vastly outperform older statistical models such as Logistic Regression (LR): civil war prediction. We find that all papers claiming the superior performance of complex ML models compared to LR models fail to reproduce due to data leakage, and complex ML models don't perform substantively better than decades-old LR models. While none of these errors could have been caught by reading the papers, model info sheets would enable the detection of leakage in each case.

PDF Abstract

An Examination of Data Leakage and Reproducibility Failures in ML-based Science

The paper by Kapoor and Narayanan addresses a critical issue confronting the integration of ML in scientific research: reproducibility crisis caused by data leakage. As ML methods increasingly permeate diverse scientific fields, the authors systematically investigate the reproducibility problems emerging from methodological pitfalls, focusing on data leakage as a prevalent issue. They employ a rigorous literature survey across 17 different research domains, discovering 329 papers affected by erroneous techniques leading to exaggerated claims of ML model performance. This research catalogues eight distinct types of data leakage and proposes solutions to enhance the reliability of ML-based scientific claims.

Key Findings

Prevalence and Types of Data Leakage

The paper documents widespread reproducibility issues due to data leakage, which arises when there is a spurious correlation between the predictors and the target variable because of the data collection and preprocessing steps. All reviewed domains in the paper, including medicine, bioinformatics, toxicology, and computer security, exhibit leakage issues despite adopting ML methods. The taxonomy of leakage is categorized based on methodological errors, such as:

Lack of Clean Train-Test Separation: This involves mistakes like no test set, preprocessing on combined data, feature selection from combined data, and dataset duplicates.
Illegitimate Features: ML models using features that are proxies or too closely related to the target variable.
Distribution Mismatch: Where test data does not reflect the distribution about which scientific claims are made, leading to temporal leakage or sampling bias.

Proposal of Model Info Sheets

To combat the identified issues, Kapoor and Narayanan propose the implementation of model info sheets as a means of documenting scientific claims made with ML models. The model info sheets serve as a standardized tool for researchers to ensure that their analysis process correctly identifies and prevents leakage, thereby improving transparency and enabling more thorough peer-review.

Empirical Case Study in Civil War Prediction

An empirical analysis in civil war prediction underscores the significance of addressing data leakage. Despite claims that complex ML models outperform logistic regression (LR), all papers asserting superior performance failed to withstand scrutiny when corrected for leakage issues. After correction, ML models exhibited no substantive advantage over LR, questioning previously reported merits of ML in this context and highlighting the unchecked optimism in fields adopting predictive paradigms. This case exemplifies the necessity for stringent methodological rigor and transparency in research, advocating for model info sheets as a tool for pre-emptive mitigation against leakage.

Implications and Future Directions

Kapoor and Narayanan’s work initiates a broader discussion about improving scientific practices in the integration of ML. Addressing reproducibility and establishing standards for computational reproducibility are central to advancing the credibility and reliability of scientific research using ML models. Practical suggestions like adoption of standardized protocols and robust reporting tools can guide fields towards overcoming optimism induced by ML "hype".

The taxonomy and model info sheet proposals by Kapoor and Narayanan provide researchers with actionable insights to detect and prevent leakage, enhancing the robustness of scientific claims. Moving forward, cross-disciplinary collaboration and widespread adoption of these practices could mitigate the reproducibility crisis in ML-based science, fostering more reliable and substantiated advancements across fields.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Sayash Kapoor (23 papers)
Arvind Narayanan (48 papers)

Citations (165)

View on Semantic Scholar