Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation (2010.16061v1)

Published 11 Oct 2020 in cs.LG, stat.ME, and stat.ML

Abstract: Commonly used evaluation measures including Recall, Precision, F-Measure and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case.

Citations (4,937)

View on Semantic Scholar

Summary

The paper introduces Informedness and Markedness to overcome biases in traditional precision, recall, and F-measure evaluation.
It employs a probabilistic framework to extend these metrics to multi-class classifications with robust statistical significance testing.
The study highlights practical improvements in fields like medical diagnostics, information retrieval, and machine learning through enhanced model evaluation.

Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation

Introduction

The paper by David M.W. Powers critiques traditional evaluation metrics such as Precision, Recall, and F-Measure, asserting they exhibit biases and fail to account for chance-level performance. These measures, originally designed for Information Retrieval, do not address negative example handling adequately, propagate marginal prevalences, and overlook chance-level performance. The paper introduces Informedness and Markedness, dual measures that reflect the probability that prediction is informed versus chance and marked versus chance, respectively. Furthermore, it explores their relationships and extensions to the multi-class case.

Critique of Common Measures

Recall and Precision are two ubiquitous metrics. Recall (Sensitivity) measures the proportion of true positive cases among the actual positive cases, while Precision (Confidence) denotes the proportion of true positive predictions among predicted positives. However, these metrics ignore true negatives and are subject to bias introduced by prevalence and chance, hence failing to provide a holistic evaluation.

F-Measure addresses this to some extent by combining Recall and Precision, but it remains insensitive to true negatives. Similarly, Rand Accuracy and Cohen's Kappa also exhibit biases despite their advantages. For instance, Rand Accuracy averages Precision and its inverse, while Kappa scales the discriminant of contingency by the mean error rate. These measures can mislead by indicating better performance than objective evaluation would suggest.

Introduction of New Concepts

Powers introduces Informedness and Markedness, derived through probabilistic and information-theoretic analyses. Informedness quantifies the probability that a prediction is informed versus chance, encapsulating true and false positive rates, whereas Markedness measures how well predictions mark the condition. These metrics address the deficiencies of traditional measures by incorporating true negatives and providing a balanced view of performance.

Mathematical Framework

Informedness (B) and Markedness (M) are formally defined as:

Informedness (B): $B = Recall + Inverse Recall - 1 = Sensitivity + Specificity - 1$
Markedness (M): $M = Precision + Inverse Precision - 1 = True Positive Accuracy + True Negative Accuracy - 1$

These metrics can be interpreted as the probabilistic model's edge over chance-level guessing. The correlations among various metrics are mathematically derived, establishing that Informedness, Markedness, and Correlation form a coherent system for evaluation, superior to traditional metrics. Notably, the Matthews Correlation Coefficient (ρ) bridges these concepts by reflecting the geometric mean of Informedness and Markedness.

Extension to Multi-Class Case

The paper generalizes these metrics to multi-class classifications, ensuring they remain robust beyond binary contexts. This is achieved by defining multi-class Informedness and Markedness through weighted averages of their dichotomous versions. Powers’s approach ensures these new metrics fully account for the complexity and variability inherent in multi-class scenarios.

Significance and Confidence

A salient feature of Powers's framework is its provision for statistically significant evaluation. Traditional significance tests are prone to misinterpretations when applied to biased metrics. Chi-Squared (χ²) and G-Test variants adapted to Informedness and Markedness provide more accurate significance testing. Powers suggests that confidence intervals for these metrics offer a nuanced approach, balancing between theoretical expectations and empirical data.

Practical and Theoretic Implications

Practically, the introduction of unbiased performance measures like Informedness and Markedness is crucial for a wide array of fields, including Medical Diagnostics, Information Retrieval, and Machine Learning. Theoretically, these measures contribute to a more robust understanding of classifier performance, ensuring models do not exploit biases inherent in data distribution.

Conclusion

Powers’s paper provides an insightful critique of traditional evaluation metrics, spotlighting their deficiencies in objective performance measurement. By introducing Informedness and Markedness, Powers not only offers theoretically sound alternatives but also furnishes a framework extendable to multi-class contexts. These metrics, integrated with robust statistical testing methods, promise a comprehensive toolkit for accurate and meaningful system evaluation, free from the biases that plague traditional measures. Future work will likely explore optimization techniques in learning systems that maximize Informedness, providing empirical validation of these theoretical advancements.

In summary, this paper challenges the status quo in evaluative metrics, proposing a paradigm shift towards unbiased, statistically sound measures that can significantly enhance model evaluation across various domains.

PDF Markdown