Papers
Topics
Authors
Recent
2000 character limit reached

How to avoid machine learning pitfalls: a guide for academic researchers (2108.02497v5)

Published 5 Aug 2021 in cs.LG

Abstract: Mistakes in machine learning practice are commonplace, and can result in a loss of confidence in the findings and products of machine learning. This guide outlines common mistakes that occur when using machine learning, and what can be done to avoid them. Whilst it should be accessible to anyone with a basic understanding of machine learning techniques, it focuses on issues that are of particular concern within academic research, such as the need to do rigorous comparisons and reach valid conclusions. It covers five stages of the machine learning process: what to do before model building, how to reliably build models, how to robustly evaluate models, how to compare models fairly, and how to report results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. On reporting and interpreting statistical significance and p values in medical research. BMJ Evidence-Based Medicine, 26(2):39–42, 2021. URL http://dx.doi.org/10.1136/bmjebm-2019-111264.
  2. Explainable artificial intelligence: an analytical review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 11(5):e1424, 2021. URL https://eprints.lancs.ac.uk/id/eprint/157114.
  3. A survey of cross-validation procedures for model selection. Statistics surveys, 4:40–79, 2010. URL https://doi.org/10.1214/09-SS054.
  4. Time for a change: a tutorial for comparing multiple classifiers through bayesian analysis. The Journal of Machine Learning Research, 18(1):2653–2688, 2017. URL https://arxiv.org/abs/1606.04316.
  5. R. A. Betensky. The p-value requires context, not a threshold. The American Statistician, 73(sup1):115–117, 2019. URL https://www.tandfonline.com/doi/full/10.1080/00031305.2018.1529624.
  6. Hyperparameter optimization: Foundations, algorithms, best practices and open challenges. arXiv preprint arXiv:2107.05847, 2021. URL https://arxiv.org/abs/2107.05847.
  7. A critical analysis of metrics used for measuring progress in artificial intelligence, 2020. URL https://arxiv.org/abs/2008.02577.
  8. Mlj: A julia package for composable machine learning. Journal of Open Source Software, 5(55):2704, 2020. URL https://doi.org/10.21105/joss.02704.
  9. Feature selection in machine learning: A new perspective. Neurocomputing, 300:70–79, 2018. URL https://doi.org/10.1016/j.neucom.2017.11.077.
  10. Recent trends in the use of statistical tests for comparing swarm and evolutionary computing algorithms: Practical guidelines and a critical review. Swarm and Evolutionary Computation, 54:100665, 2020. URL https://doi.org/10.1016/j.swevo.2020.100665.
  11. On over-fitting in model selection and subsequent selection bias in performance evaluation. The Journal of Machine Learning Research, 11:2079–2107, 2010. URL https://www.jmlr.org/papers/volume11/cawley10a/cawley10a.pdf.
  12. Evaluating time series forecasting models: An empirical study on performance estimation methods. Machine Learning, 109(11):1997–2028, 2020. URL https://doi.org/10.1007/s10994-020-05910-7.
  13. Developments in mlflow: A system to accelerate the machine learning lifecycle. In Proceedings of the fourth international workshop on data management for end-to-end machine learning, pages 1–4, 2020. URL https://cs.stanford.edu/~matei/papers/2020/deem_mlflow.pdf.
  14. V. Cox. Exploratory data analysis. In Translating Statistics to Make Decisions, pages 47–74. Springer, 2017.
  15. A survey on ensemble learning. Frontiers of Computer Science, 14(2):241–258, 2020. URL https://doi.org/10.1007/s11704-019-8208-z.
  16. E. Gibney. Is AI fuelling a reproducibility crisis in science. Nature, 608:250–251, 2022. URL https://doi.org/10.1038/d41586-022-02035-w.
  17. Why do tree-based models still outperform deep learning on tabular data? arXiv preprint arXiv:2207.08815, 2022. URL https://arxiv.org/abs/2207.08815.
  18. Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73:220–239, 2017. URL https://doi.org/10.1016/j.eswa.2016.12.035.
  19. Pre-trained models: Past, present and future. AI Open, 2:225–250, 2021. URL https://arxiv.org/abs/2106.07139.
  20. Automl: A survey of the state-of-the-art. Knowledge-Based Systems, 212:106622, 2021. URL https://arxiv.org/abs/1908.00709.
  21. The extent and consequences of p-hacking in science. PLoS Biol, 13(3):e1002106, 2015. URL https://doi.org/10.1371/journal.pbio.1002106.
  22. Forecast evaluation for data scientists: common pitfalls and best practices. Data Mining and Knowledge Discovery, 37(2):788–832, 2023. URL https://arxiv.org/abs/2203.10716.
  23. I tried a bunch of things: The dangers of unexpected overfitting in classification of brain data. Neuroscience & Biobehavioral Reviews, 119:456–467, 2020. URL https://www.biorxiv.org/content/10.1101/078816v2.abstract.
  24. B. K. Iwana and S. Uchida. An empirical survey of data augmentation for time series classification with neural networks. Plos one, 16(7):e0254841, 2021. URL https://arxiv.org/abs/2007.15951.
  25. S. Kapoor and A. Narayanan. Leakage and the reproducibility crisis in ml-based science. arXiv preprint arXiv:2207.07048, 2022. URL https://arxiv.org/abs/2207.07048.
  26. Reforms: Reporting standards for machine learning based science. arXiv preprint arXiv:2308.07832, 2023. URL https://arxiv.org/abs/2308.07832.
  27. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41, 2022. URL https://arxiv.org/abs/2101.01169.
  28. M. Kuhn. A short introduction to the caret package. R Found Stat Comput, 1, 2015. URL https://cran.r-project.org/web/packages/caret/vignettes/caret.html.
  29. Privacy in large language models: Attacks, defenses and future directions. arXiv preprint arXiv:2310.10383, 2023. URL https://arxiv.org/abs/2310.10383.
  30. A survey of data-driven and knowledge-aware explainable ai. IEEE Transactions on Knowledge and Data Engineering, 2020. URL https://doi.org/10.1109/TKDE.2020.2983930.
  31. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems, 2021. URL https://arxiv.org/abs/2004.02806.
  32. Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=mPducS1MsEK.
  33. A survey of transformers. AI Open, 2022. URL https://arxiv.org/abs/2106.04554.
  34. A survey on evolutionary neural architecture search. IEEE transactions on neural networks and learning systems, 2021. URL https://arxiv.org/abs/2008.10937.
  35. General pitfalls of model-agnostic interpretation methods for machine learning models. In International Workshop on Extending Explainable AI Beyond Deep Models and Classifiers, pages 39–68. Springer, 2020. URL https://arxiv.org/abs/2007.04131.
  36. Data and its (dis)contents: A survey of dataset development and use in machine learning research, 2020. URL https://arxiv.org/abs/2012.05345.
  37. Improving reproducibility in machine learning research (a report from the neurips 2019 reproducibility program), 2020. URL https://arxiv.org/abs/2003.12206.
  38. S. Raschka. Model evaluation, model selection, and algorithm selection in machine learning, 2020. URL https://arxiv.org/abs/1811.12808.
  39. Common pitfalls and recommendations for using machine learning to detect and prognosticate for covid-19 using chest radiographs and ct scans. Nature Machine Intelligence, 3(3):199–217, 2021. URL https://doi.org/10.1038/s42256-021-00307-0.
  40. C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019. URL https://arxiv.org/abs/1811.10154.
  41. S. L. Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data mining and knowledge discovery, 1(3):317–328, 1997. URL https://doi.org/10.1023/A:1009752403260.
  42. J. Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015. URL https://arxiv.org/abs/1404.7828.
  43. Hidden technical debt in machine learning systems. Advances in neural information processing systems, 28:2503–2511, 2015. URL https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf.
  44. Operationalizing machine learning: An interview study. arXiv preprint arXiv:2209.09125, 2022. URL https://arxiv.org/abs/2209.09125.
  45. C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):1–48, 2019. URL https://doi.org/10.1186/s40537-019-0197-0.
  46. Recommendations for reporting machine learning analyses in clinical research. Circulation: Cardiovascular Quality and Outcomes, 13(10):e006556, 2020. URL https://doi.org/10.1161/CIRCOUTCOMES.120.006556.
  47. D. L. Streiner. Best (but oft-forgotten) practices: the multiple problems of multiplicity—whether and how to correct for many statistical tests. The American journal of clinical nutrition, 102(4):721–728, 2015. URL https://doi.org/10.3945/ajcn.115.113548.
  48. D. A. Tamburri. Sustainable MLOps: Trends and challenges. In 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), pages 17–23. IEEE, 2020. URL https://doi.org/10.1109/SYNASC51798.2020.00015.
  49. Overly optimistic prediction results on imbalanced data: a case study of flaws and benefits when applying over-sampling. Artificial Intelligence in Medicine, 111:101987, 2021. URL https://arxiv.org/abs/2001.06296.
  50. Scikit-learn: Machine learning without learning the machinery. GetMobile: Mobile Computing and Communications, 19(1):29–33, 2015. URL https://doi.org/10.1145/2786984.2786995.
  51. W. Wang and J. Ruf. Information leakage in backtesting. Available at SSRN 3836631, 2022. URL https://dx.doi.org/10.2139/ssrn.3836631.
  52. D. H. Wolpert. The supervised learning no-free-lunch theorems. Soft computing and industry, pages 25–42, 2002. URL https://doi.org/10.1007/978-1-4471-0123-9_3.
  53. Understanding data augmentation for classification: when to warp? In 2016 international conference on digital image computing: techniques and applications (DICTA), pages 1–6. IEEE, 2016. URL https://arxiv.org/abs/1609.08764.
  54. L. Yang and A. Shami. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing, 415:295–316, 2020. URL https://doi.org/10.1016/j.neucom.2020.07.061.
  55. Are transformers effective for time series forecasting? arXiv preprint arXiv:2205.13504, 2022. URL https://arxiv.org/abs/2205.13504.
  56. Dive into deep learning. arXiv preprint arXiv:2106.11342, 2021. URL https://arxiv.org/abs/2106.11342.
  57. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419, 2023. URL https://arxiv.org/abs/2302.09419.
Citations (70)

Summary

  • The paper establishes a comprehensive framework to avoid common ML pitfalls by emphasizing rigorous data analysis and proper model evaluation.
  • It highlights the importance of strict dataset partitioning and preventing test data leakage to ensure reliable performance metrics.
  • The guide advocates transparency and reproducibility in research by urging detailed reporting and fair statistical comparisons in model assessments.

Avoiding Machine Learning Pitfalls: A Comprehensive Guide for Researchers

In ML research, it is easy to encounter pitfalls that can lead to unreliable models and erroneous conclusions. The paper "How to Avoid Machine Learning Pitfalls: A Guide for Academic Researchers" by Michael A. Lones provides a robust framework to navigate the various stages of ML development, addressing common errors encountered in academic research. This essay synthesizes the key points of the paper, emphasizing their implications for enhancing the rigor and reliability of ML research.

Key Considerations Before Model Building

The paper emphasizes the critical importance of the data preparation phase, stressing that a thorough understanding of the dataset forms the foundation of any ML project. Researchers are urged to scrutinize the provenance and quality of their data sources and to engage in exploratory data analysis to unearth potential issues, such as missing data or class imbalances, before proceeding to model training. Furthermore, engaging with domain experts can fine-tune the research goals and ensure the relevance of the results.

Ensuring Reliable Model Building

A significant pitfall the paper highlights is the leakage of test data into the training process, which can inflate model performance inadvertently. To prevent this, researchers are advised to partition datasets judiciously, employing validation sets for tuning and testing sets strictly for final evaluations. The importance of employing a range of models and not relying on overly complex models like deep neural networks without sufficient data is reiterated. The paper also discusses hyperparameter optimization and the role of cross-validation in validating model robustness.

Robust Model Evaluation

The robustness of model evaluation is a recurring theme. Utilizing appropriate and representative test sets is crucial, especially when dealing with temporal dependencies in time-series data to avoid look-ahead biases. The paper advocates for multiple evaluations to mitigate the effects of instability in ML models. Moreover, selecting suitable performance metrics, especially in classification tasks with imbalanced classes, is critical to ensuring valid assessments of model efficacy.

Fair Model Comparison and Reporting

In the field of ML research, fair model comparison is paramount. The paper cautions against direct numerical comparisons with models evaluated in different contexts or datasets. Employing statistical tests for performance comparison and adjusting for multiple comparisons helps substantiate the claims of superiority. In reporting results, transparency is underscored, with a call for sharing comprehensive experimental details and scripts to enhance reproducibility and reliability.

Implications and Future Work

The implications of the methods discussed in this paper are both practical and theoretical, influencing how researchers evaluate models and contribute to ML knowledge. By adhering to these principles, researchers can reduce overhyped claims and the perpetuation of unreliable findings. Future advancements in automated ML and foundational models, as hinted at in the paper, might further shift the landscape of ML research, demanding new best practices and nuanced evaluations.

In summary, the guidance outlined in this paper provides a valuable framework for improving the reliability and reproducibility of ML research, emphasizing the accountability of researchers in presenting accurate and meaningful contributions to the field. As ML continues to permeate various domains, adhering to these guidelines will be essential to sustaining trust in ML-based solutions and fostering genuine scientific progress.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 3 likes about this paper.

HackerNews