Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Confidence Intervals for the F1 Score: A Comparison of Four Methods (2309.14621v2)

Published 26 Sep 2023 in stat.ME

Abstract: In NLP, binary classification algorithms are often evaluated using the F1 score. Because the sample F1 score is an estimate of the population F1 score, it is not sufficient to report the sample F1 score without an indication of how accurate it is. Confidence intervals are an indication of how accurate the sample F1 score is. However, most studies either do not report them or report them using methods that demonstrate poor statistical properties. In the present study, I review current analytical methods (i.e., Clopper-Pearson method and Wald method) to construct confidence intervals for the population F1 score, propose two new analytical methods (i.e., Wilson direct method and Wilson indirect method) to do so, and compare these methods based on their coverage probabilities and interval lengths, as well as whether these methods suffer from overshoot and degeneracy. Theoretical results demonstrate that both proposed methods do not suffer from overshoot and degeneracy. Experimental results suggest that both proposed methods perform better, as compared to current methods, in terms of coverage probabilities and interval lengths. I illustrate both current and proposed methods on two suggestion mining tasks. I discuss the practical implications of these results, and suggest areas for future research.

Summary

We haven't generated a summary for this paper yet.