We Need to Talk About Classification Evaluation Metrics in NLP (2401.03831v1)

Published 8 Jan 2024 in cs.CL and cs.LG

Abstract: In NLP classification tasks such as topic categorisation and sentiment analysis, model generalizability is generally measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC. The diversity of metrics, and the arbitrariness of their application suggest that there is no agreement within NLP on a single best metric to use. This lack suggests there has not been sufficient examination of the underlying heuristics which each metric encodes. To address this we compare several standard classification metrics with more 'exotic' metrics and demonstrate that a random-guess normalised Informedness metric is a parsimonious baseline for task performance. To show how important the choice of metric is, we perform extensive experiments on a wide range of NLP tasks including a synthetic scenario, natural language understanding, question answering and machine translation. Across these tasks we use a superset of metrics to rank models and find that Informedness best captures the ideal model characteristics. Finally, we release a Python implementation of Informedness following the SciKitLearn classifier format.

Citations (1)

View on Semantic Scholar

Summary

The paper highlights that conventional metrics like accuracy and F1-Measure can mislead due to issues like the Accuracy Paradox and class imbalance.
The paper demonstrates that Informedness offers an unbiased evaluation by accurately reflecting a model’s ability to outperform random chance.
The paper presents empirical studies along with a Python implementation to encourage community adoption of improved evaluation practices.

Understanding Classification Evaluation Metrics in NLP

The Significance of Evaluation Metrics

When developing models for NLP tasks such as sentiment analysis or topical categorization, accurately measuring model performance is crucial. Performance is typically gauged using standard metrics like Accuracy, F-Measure, or AUC-ROC. However, the reliability of these metrics often hinges on predetermined heuristics that can skew our interpretation of a model’s capabilities.

The Shortcomings of Common Metrics

It turns out that common evaluation metrics might not tell the whole story. For instance, accuracy can fall victim to what's known as the "Accuracy Paradox"; that is, a model might achieve high accuracy by merely guessing the most common class, a phenomenon also referred to as baseline credit. Similarly, F1-Measure (the harmonic mean of precision and recall) might not be dependable for comparing models since its value can vary based on the class distribution within a dataset.

A More Robust Alternative: Informedness

Researchers are now advocating for the use of an evaluation metric known as "Informedness", which represents the likelihood of a model making an informed decision—an estimate of a model's ability to predict better than random chance. This metric is unbiased and remains unaffected by the prevalence of any particular class in a dataset. Consequently, it provides a fairer comparison between models and a deeper understanding of their true predictive capabilities across varied tasks.

Empirical Studies on Informedness

In a comprehensive paper, several NLP tasks, including synthetic scenarios, natural language understanding, question answering, and machine translation, were evaluated using not just conventional metrics but also more nuanced ones like Informedness. The paper's findings suggest significant discrepancies between traditional metrics like Accuracy and more informative ones like Informedness. Notably, in cases of unbalanced class distributions and specific sub-tasks within datasets, conventional metrics were misleading, either overestimating or underestimating model capabilities compared to Informedness.

Implementation and Community Adoption

To facilitate further research and encourage community-wide adoption of Informedness, a Python implementation conforming to the SciKitLearn classifier format has been released. This initiative is part of an effort to reshape model evaluation practices in NLP, steering away from biased metrics and helping practitioners make more insightful decisions about their models' performance.

Final Reflections

The paper echoes a broader sentiment within the AI community: the methods used to gauge model performance need to evolve. By shifting towards metrics that offer a fair and intuitive measure of model capability—like Informedness—researchers can better understand their models and address shortcomings in a more targeted manner. As a result, they will likely develop systems that are not just statistically impressive but also genuinely capable of understanding and processing language effectively.

PDF Markdown

Related Papers

Tweets

https://twitter.com/nikaletras/status/1746110058721575023