Accounting for multiplicity in machine learning benchmark performance (2303.07272v5)
Abstract: Machine learning methods are commonly evaluated and compared by their performance on data sets from public repositories. This allows for multiple methods, oftentimes several thousands, to be evaluated under identical conditions and across time. The highest ranked performance on a problem is referred to as state-of-the-art (SOTA) performance, and is used, among other things, as a reference point for publication of new methods. Using the highest-ranked performance as an estimate for SOTA is a biased estimator, giving overly optimistic results. The mechanisms at play are those of multiplicity, a topic that is well-studied in the context of multiple comparisons and multiple testing, but has, as far as the authors are aware of, been nearly absent from the discussion regarding SOTA estimates. The optimistic state-of-the-art estimate is used as a standard for evaluating new methods, and methods with substantial inferior results are easily overlooked. In this article, we provide a probability distribution for the case of multiple classifiers so that known analyses methods can be engaged and a better SOTA estimate can be provided. We demonstrate the impact of multiplicity through a simulated example with independent classifiers. We show how classifier dependency impacts the variance, but also that the impact is limited when the accuracy is high. Finally, we discuss three real-world examples; Kaggle competitions that demonstrate various aspects.
- John W Tukey. The philosophy of multiple comparisons. Statistical Science, 6:100–116, 1991. URL https://www.jstor.org/stable/2245714#metadata_info_tab_contents.
- Neural correlates of interspecies perspective taking in the post-mortem atlantic salmon: An argument for multiple comparisons correction. Neuroimage, 47:S125, 2009.
- Janez Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006.
- Steven L. Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1:317–328, 1997. ISSN 13845810. doi: 10.1023/A:1009752403260/METRICS. URL https://link.springer.com/article/10.1023/A:1009752403260.
- Dataset decay and the problem of sequential analyses on open datasets. eLife, 9:1–17, 5 2020. ISSN 2050084X. doi: 10.7554/ELIFE.53498.
- On the value of out-of-distribution testing: An example of goodhart’s law. Advances in Neural Information Processing Systems, 33:407–417, 2020.
- Unbiased look at dataset bias. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1521–1528, 2011. ISSN 10636919. doi: 10.1109/CVPR.2011.5995347.
- The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111:98–136, 1 2015. ISSN 15731405. doi: 10.1007/S11263-014-0733-5/FIGURES/27. URL https://link.springer.com/article/10.1007/s11263-014-0733-5.
- Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15:3133–3181, 2014. URL http://www.mathworks.es/products/neural-network.
- The ladder: A reliable leaderboard for machine learning competitions. pages 1006–1014. PMLR, 6 2015. URL https://proceedings.mlr.press/v37/blum15.html.
- Guilt-free data reuse. Communications of the ACM, 60:86–93, 4 2017. ISSN 15577317. doi: 10.1145/3051088.
- Do imagenet classifiers generalize to imagenet? pages 5389–5400. PMLR, 5 2019. URL https://proceedings.mlr.press/v97/recht19a.html.
- The advantages of multiple classes for reducing overfitting from test set reuse. pages 1892–1900. PMLR, 5 2019. URL https://proceedings.mlr.press/v97/feldman19a.html.
- Model similarity mitigates test set overuse. Advances in Neural Information Processing Systems, 32, 2019.
- A meta-analysis of overfitting in machine learning. Advances in Neural Information Processing Systems, 32, 2019. URL https://www.kaggle.com/kaggle/meta-kaggle.
- Thomas G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10:1895–1923, 10 1998. ISSN 0899-7667. doi: 10.1162/089976698300017197. URL https://direct.mit.edu/neco/article/10/7/1895/6224/Approximate-Statistical-Tests-for-Comparing.
- State of the art: Reproducibility in artificial intelligence. Proceedings of the AAAI Conference on Artificial Intelligence, 32:1644–1651, 4 2018. ISSN 2374-3468. doi: 10.1609/AAAI.V32I1.11503. URL https://ojs.aaai.org/index.php/AAAI/article/view/11503.
- Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking. Advances in Neural Information Processing Systems, 34:10351–10367, 12 2021. URL https://eval.ai.
- Accounting for variance in machine learning benchmarks. Proceedings of Machine Learning and Systems, 3:747–769, 3 2021.
- Stability and generalization. Journal of Machine Learning Research, 2:499–526, 2002. URL http://sensitivity-analysis.jrc.cec.eu.int/.
- Michel Talagrand. A new look at independence. The Annals of Probability, 24:1–34, 1996. URL https://www.jstor.org/stable/2244830#metadata_info_tab_contents.
- Machine learning for medical imaging: methodological failures and recommendations for the future. npj Digital Medicine 2022 5:1, 5:1–8, 4 2022. ISSN 2398-6352. doi: 10.1038/s41746-022-00592-y. URL https://www.nature.com/articles/s41746-022-00592-y.
- Modelling dependence in simple and indirect majority systems. Journal of Applied Probability, 26:81–88, 3 1989. ISSN 0021-9002. doi: 10.2307/3214318. URL https://www.cambridge.org/core/journals/journal-of-applied-probability/article/abs/modelling-dependence-in-simple-and-indirect-majority-systems/070D6335BDDDDC7AF4D70BC9B21B0B7B.
- Daniel W. Ladd. An algorithm for the binomial distribution with dependent trials. Journal of the American Statistical Association, 70:333–340, 1975. ISSN 1537274X. doi: 10.1080/01621459.1975.10479867.
- Correlated binomial models and correlation structures. 2006.
- A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Scientific Data 2021 8:1, 8:1–8, 1 2021. ISSN 2052-4463. doi: 10.1038/s41597-021-00815-z. URL https://www.nature.com/articles/s41597-021-00815-z.
- Preparing medical imaging data for machine learning. Radiology, 295:4–15, 2 2020. ISSN 15271315. doi: 10.1148/RADIOL.2020192224/ASSET/IMAGES/LARGE/RADIOL.2020192224.FIG5B.JPEG. URL https://pubs.rsna.org/doi/10.1148/radiol.2020192224.
- 10 2020. doi: 10.48550/arxiv.2010.05351. URL https://arxiv.org/abs/2010.05351v1.
Collections
Sign up for free to add this paper to one or more collections.