A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice (2404.16958v2)
Abstract: Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Frequently, metrics are selected without arguments, and blurry terminology invites misconceptions. For instance, many works use so-called 'macro' metrics to rank systems (e.g., 'macro F1') but do not clearly specify what they would expect from such a `macro' metric. This is problematic, since picking a metric can affect research findings, and thus any clarity in the process should be maximized. Starting from the intuitive concepts of bias and prevalence, we perform an analysis of common evaluation metrics. The analysis helps us understand the metrics' underlying properties, and how they align with expectations as found expressed in papers. Then we reflect on the practical situation in the field, and survey evaluation practice in recent shared tasks. We find that metric selection is often not supported with convincing arguments, an issue that can make a system ranking seem arbitrary. Our work aims at providing overview and guidance for more informed and transparent metric selection, fostering meaningful evaluation.
- An effectiveness metric for ordinal classification: Formal properties and experimental results. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3938–3949, Online. Association for Computational Linguistics.
- SemEval 2018 task 2: Multilingual emoji prediction. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 24–33, New Orleans, Louisiana. Association for Computational Linguistics.
- DataStories at SemEval-2017 task 4: Deep LSTM with attention for message-level and topic-based sentiment analysis. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 747–754, Vancouver, Canada. Association for Computational Linguistics.
- Detecting spammers on twitter. In Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), volume 6, page 12.
- Davide Chicco and Giuseppe Jurman. 2020. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21(1):1–13.
- Mathieu Cliche. 2017. BB_twtr at SemEval-2017 task 4: Twitter sentiment analysis with CNNs and LSTMs. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 573–580, Vancouver, Canada. Association for Computational Linguistics.
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.
- Rosario Delgado and Xavier-Andoni Tibau. 2019. Why cohen’s kappa should be avoided as performance measure in classification. PLOS ONE, 14(9):1–26.
- SemEval-2021 task 6: Detection of persuasion techniques in texts and images. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 70–98, Online. Association for Computational Linguistics.
- Discriminatively-Tuned Generative Classifiers for Robust Natural Language Inference. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8189–8202, Online. Association for Computational Linguistics.
- Epicurus at SemEval-2023 task 4: Improving prediction of human values behind arguments by leveraging their definitions. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 221–229, Toronto, Canada. Association for Computational Linguistics.
- Tom Fawcett. 2006. An introduction to roc analysis. Pattern recognition letters, 27(8):861–874.
- Peter Flach and Meelis Kull. 2015. Precision-recall-gain curves: Pr analysis done right. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
- Interpretable multi-dataset evaluation for named entity recognition. arXiv preprint arXiv:2011.06854.
- Generative openmax for multi-class open set classification. In British Machine Vision Conference. BMVA Press.
- Jan Gorodkin. 2004. Comparing two k-category assignments by a k-category correlation coefficient. Computational biology and chemistry, 28(5-6):367–374.
- Metrics for multi-class classification: an overview. arXiv preprint arXiv:2008.05756.
- Hussam Hamdan. 2017. Senti17 at SemEval-2017 task 4: Ten convolutional neural network voters for tweet polarity classification. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 700–703, Vancouver, Canada. Association for Computational Linguistics.
- TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3905–3920, Seattle, United States. Association for Computational Linguistics.
- Mohammed Jabreel and Antonio Moreno. 2017. SiTAKA at SemEval-2017 task 4: Sentiment analysis in Twitter based on a rich set of features. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 694–699, Vancouver, Canada. Association for Computational Linguistics.
- Practical text classification with large pre-trained language models. arXiv preprint arXiv:1812.01207.
- Achieving Forgetting Prevention and Knowledge Transfer in Continual Learning. In Advances in Neural Information Processing Systems, volume 34, pages 22443–22456. Curran Associates, Inc.
- Tweester at SemEval-2017 task 4: Fusion of semantic-affective and pairwise classification models for sentiment analysis in Twitter. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 675–682, Vancouver, Canada. Association for Computational Linguistics.
- Evaluation measures for hierarchical classification: a unified view and novel approaches. Data Mining and Knowledge Discovery, 29:820–865.
- The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition, 91:216–231.
- Christopher D Manning. 2009. An introduction to information retrieval. Cambridge university press.
- INGEOTEC at SemEval 2017 task 4: A B4MSA ensemble based on genetic programming for Twitter sentiment analysis. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 771–776, Vancouver, Canada. Association for Computational Linguistics.
- Cooking up a neural-based model for recipe classification. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5000–5009, Marseille, France. European Language Resources Association.
- Thomas A O’Neill. 2017. An overview of interrater agreement on likert scales for researchers and practitioners. Frontiers in psychology, 8:777.
- Juri Opitz and Sebastian Burst. 2019. Macro f1 and macro f1. arXiv preprint arXiv:1911.03347.
- David M.W. Powers. 2003. Recall and precision versus the bookmaker. In Cognitive Science - COGSCI, pages 529–534.
- David M.W. Powers. 2011. Evaluation: From precision, recall and f-measure to roc, informedness, markedness & correlation. Journal of Machine Learning Technologies, 2(1):37–63.
- David M.W. Powers. 2012. The problem with kappa. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 345–355, Avignon, France. Association for Computational Linguistics.
- David M.W. Powers. 2015. What the f-measure doesn’t measure: Features, flaws, fallacies and fixes. arXiv preprint arXiv:1503.06410.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
- João António Rodrigues and António Branco. 2022. Transferring confluent knowledge to argument mining. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6859–6874, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Semeval-2017 task 4: Sentiment analysis in twitter. In Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017), pages 502–518.
- NEATCLasS 2023: The 2nd Workshop on Novel Evaluation Approaches for Text Classification Systems. Workshop Proceedings of the 17th International AAAI Conference on Web and Social Media.
- Mickael Rouvier. 2017. LIA at SemEval-2017 task 4: An ensemble of neural networks for sentiment classification. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 760–765, Vancouver, Canada. Association for Computational Linguistics.
- Fabrizio Sebastiani. 2015. An axiomatically derived measure for the evaluation of classification algorithms. In Proceedings of the 2015 International Conference on The Theory of Information Retrieval, ICTIR ’15, page 11–20, New York, NY, USA. Association for Computing Machinery.
- Marina Sokolova and Guy Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Information processing & management, 45(4):427–437.
- Christian Stab and Iryna Gurevych. 2017. Parsing argumentation structures in persuasive essays. Computational Linguistics, 43(3):619–659.
- Alaa Tharwat. 2020. Classification assessment methods. Applied computing and informatics, 17(1):168–192.
- SemEval-2018 task 3: Irony detection in English tweets. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 39–50, New Orleans, Louisiana. Association for Computational Linguistics.
- Don’t waste a single annotation: improving single-label classifiers through soft labels. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5347–5355, Singapore. Association for Computational Linguistics.
- Financial sentiment analysis: An investigation into common mistakes and silver bullets. In Proceedings of the 28th International Conference on Computational Linguistics, pages 978–987, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators. In Proceedings of the 12th International Conference on Natural Language Generation, pages 65–75, Tokyo, Japan. Association for Computational Linguistics.
- NNEMBs at SemEval-2017 task 4: Neural Twitter sentiment classification: a simple ensemble method with different embeddings. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 621–625, Vancouver, Canada. Association for Computational Linguistics.
- Enhancing naive bayes with various smoothing methods for short text classification. In Proceedings of the 21st International Conference on World Wide Web, pages 645–646.
- SemEval-2019 task 6: Identifying and categorizing offensive language in social media (OffensEval). In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 75–86, Minneapolis, Minnesota, USA. Association for Computational Linguistics.
- Juri Opitz (30 papers)