Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Questionable practices in machine learning (2407.12220v2)

Published 17 Jul 2024 in cs.LG, cs.CL, and cs.CY

Abstract: Evaluating modern ML models is hard. The strong incentive for researchers and companies to report a state-of-the-art result on some metric often leads to questionable research practices (QRPs): bad practices which fall short of outright research fraud. We describe 44 such practices which can undermine reported results, giving examples where possible. Our list emphasises the evaluation of LLMs on public benchmarks. We also discuss "irreproducible research practices", i.e. decisions that make it difficult or impossible for other researchers to reproduce, build on or audit previous research.

Citations (2)

Summary

  • The paper identifies and categorizes 44 questionable research practices in machine learning evaluation, grouping them into contamination, cherrypicking, and misreporting.
  • These practices undermine the reliability and veracity of reported results, contributing to the reproducibility crisis and risking public trust in AI technologies.
  • Addressing these practices requires adopting open science, independent evaluation frameworks, and aligning research incentives with scientific diligence.

Overview of "Questionable Practices in Machine Learning"

The paper "Questionable Practices in Machine Learning" by Gavin Leech and collaborators rigorously examines various practices in ML that can potentially undermine the reliability and veracity of reported results. The paper aims to catalog and elucidate 44 questionable research practices (QRPs) within the domain, particularly emphasizing the evaluation of LLMs on public benchmarks. This work provides a crucial taxonomy of QRPs and irreproducible practices that may cause scientific and methodological issues in ML research.

Main Contributions

  1. Identification of QRPs: The authors categorize QRPs into three primary families: contamination, cherrypicking, and misreporting. Each category encapsulates several sub-practices that could skew the interpretation of ML model performance.
    • Contamination includes improper use of test data during training or evaluation, which can severely compromise the purported generalization strength of a model.
    • Cherrypicking involves selective reporting or optimization, wherein researchers might unintentionally or deliberately present only the most favorable results.
    • Misreporting refers to various forms of data presentation and claims that mislead regarding the actual capability and novelty of a ML method.
  2. Irreproducible Research Practices (IRPs): Beyond QRPs, the paper also discusses practices that hinder reproducibility, such as dataset hiding, which prevent external validation and audit of ML results.
  3. Mitigation Strategies: While primarily a diagnostic in nature, the paper suggests mitigative tactics, such as using standard evaluation harnesses, monitoring contamination, using private benchmark test sets, and employing strict reporting standards, to counter these QRPs.
  4. Theoretical and Practical Implications: The framework places significant emphasis on understanding the scientific integrity and methodological robustness in ML research. On a granular level, these questionable practices pose a substantial risk to the trustworthiness of reported model performances, which can have extensive repercussions in both academic and industrial applications of AI.

Discussion and Future Directions

This paper opens a dialogue on the integrity of machine learning evaluation practices. By academically recognizing and cataloging these questionable practices, the work underlines the pressing need for the ML community to address systemic incentives that lead to these issues. Crucially, QRPs and IRPs can skew scientific findings and fuel the reproducibility crisis, potentially stalling scientific progress and public trust in AI technologies.

Future developments in AI may necessitate:

  • The wider adoption of open science practices, including the use of open datasets and code transparency, to improve reproducibility.
  • Establishing independent auditing and evaluation frameworks to validate and certify the results.
  • Shifting incentives in academia and industry to align more closely with scientific diligence rather than performance metrics alone.

Conclusion

"Questionable Practices in Machine Learning" serves as a comprehensive critique of methodological shortcomings within ML research, particularly in the context of evaluating LLMs. The paper’s well-defined QRPs and IRPs highlight systemic issues that demand community-wide solutions to bolster the field's scientific rigor. By understanding and preemptively addressing these practices, researchers can ensure robust and transparent advancement of AI technologies. This proactive approach will be central to ensuring that future developments in AI are both scientifically sound and societally beneficial.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews