Questionable practices in machine learning (2407.12220v2)

Published 17 Jul 2024 in cs.LG, cs.CL, and cs.CY

Abstract: Evaluating modern ML models is hard. The strong incentive for researchers and companies to report a state-of-the-art result on some metric often leads to questionable research practices (QRPs): bad practices which fall short of outright research fraud. We describe 44 such practices which can undermine reported results, giving examples where possible. Our list emphasises the evaluation of LLMs on public benchmarks. We also discuss "irreproducible research practices", i.e. decisions that make it difficult or impossible for other researchers to reproduce, build on or audit previous research.

Citations (2)

View on Semantic Scholar

Summary

The paper identifies and categorizes 44 questionable research practices in machine learning evaluation, grouping them into contamination, cherrypicking, and misreporting.
These practices undermine the reliability and veracity of reported results, contributing to the reproducibility crisis and risking public trust in AI technologies.
Addressing these practices requires adopting open science, independent evaluation frameworks, and aligning research incentives with scientific diligence.

Overview of "Questionable Practices in Machine Learning"

The paper "Questionable Practices in Machine Learning" by Gavin Leech and collaborators rigorously examines various practices in ML that can potentially undermine the reliability and veracity of reported results. The paper aims to catalog and elucidate 44 questionable research practices (QRPs) within the domain, particularly emphasizing the evaluation of LLMs on public benchmarks. This work provides a crucial taxonomy of QRPs and irreproducible practices that may cause scientific and methodological issues in ML research.

Main Contributions

Identification of QRPs: The authors categorize QRPs into three primary families: contamination, cherrypicking, and misreporting. Each category encapsulates several sub-practices that could skew the interpretation of ML model performance.
- Contamination includes improper use of test data during training or evaluation, which can severely compromise the purported generalization strength of a model.
- Cherrypicking involves selective reporting or optimization, wherein researchers might unintentionally or deliberately present only the most favorable results.
- Misreporting refers to various forms of data presentation and claims that mislead regarding the actual capability and novelty of a ML method.
Irreproducible Research Practices (IRPs): Beyond QRPs, the paper also discusses practices that hinder reproducibility, such as dataset hiding, which prevent external validation and audit of ML results.
Mitigation Strategies: While primarily a diagnostic in nature, the paper suggests mitigative tactics, such as using standard evaluation harnesses, monitoring contamination, using private benchmark test sets, and employing strict reporting standards, to counter these QRPs.
Theoretical and Practical Implications: The framework places significant emphasis on understanding the scientific integrity and methodological robustness in ML research. On a granular level, these questionable practices pose a substantial risk to the trustworthiness of reported model performances, which can have extensive repercussions in both academic and industrial applications of AI.

Discussion and Future Directions

This paper opens a dialogue on the integrity of machine learning evaluation practices. By academically recognizing and cataloging these questionable practices, the work underlines the pressing need for the ML community to address systemic incentives that lead to these issues. Crucially, QRPs and IRPs can skew scientific findings and fuel the reproducibility crisis, potentially stalling scientific progress and public trust in AI technologies.

Future developments in AI may necessitate:

The wider adoption of open science practices, including the use of open datasets and code transparency, to improve reproducibility.
Establishing independent auditing and evaluation frameworks to validate and certify the results.
Shifting incentives in academia and industry to align more closely with scientific diligence rather than performance metrics alone.

Conclusion

"Questionable Practices in Machine Learning" serves as a comprehensive critique of methodological shortcomings within ML research, particularly in the context of evaluating LLMs. The paper’s well-defined QRPs and IRPs highlight systemic issues that demand community-wide solutions to bolster the field's scientific rigor. By understanding and preemptively addressing these practices, researchers can ensure robust and transparent advancement of AI technologies. This proactive approach will be central to ensuring that future developments in AI are both scientifically sound and societally beneficial.

PDF Markdown

Related Papers

Tweets

https://twitter.com/strnr/status/1875841354686710149

https://twitter.com/g_leech_/status/1813753349709385992

https://twitter.com/moorejh/status/1875560579576062376

https://twitter.com/Mahtaao/status/1839805198467760505

https://twitter.com/jalajupadhyay/status/1875872910839255097

https://twitter.com/DamiBenveniste/status/1816502292344123491

YouTube

Show All Videos

HackerNews

Questionable Practices in Machine Learning (6 points, 1 comment)