Macro F1 and Macro F1 (1911.03347v3)

Published 8 Nov 2019 in cs.LG and stat.ML

Abstract: The 'macro F1' metric is frequently used to evaluate binary, multi-class and multi-label classification problems. Yet, we find that there exist two different formulas to calculate this quantity. In this note, we show that only under rare circumstances the two computations can be considered equivalent. More specifically, one formula well 'rewards' classifiers which produce a skewed error type distribution. In fact, the difference in outcome of the two computations can be as high as 0.5. The two computations may not only diverge in their scalar result but can also lead to different classifier rankings.

Citations (160)

View on Semantic Scholar

Summary

Analyzing Two Interpretations of Macro F1 Score: Insights and Implications

The paper "Macro F1 and Macro F1: A Note" by Juri Opitz and Sebastian Burst presents an important analysis of the macro F1 metric, a widely used performance metric in classification tasks. This work highlights a critical nuance in the computation of macro F1 scores, revealing the potential implications for classifier evaluation and ranking.

Key Insights from the Paper

The authors expose the existence of two distinct formulas for calculating macro F1. They designate these formulas as averaged F1 and F1 of averages. The averaged F1 is computed as the arithmetic mean of the harmonic means of precision and recall for each class. In contrast, the F1 of averages involves calculating the harmonic mean of the arithmetic means of precision and recall. These formulas might appear equivalent at first glance, but the paper demonstrates that under most practical scenarios, they yield different results.

Mathematical Analysis

The paper conducts a thorough mathematical analysis of these formulas:

Inequality and Divergence: It is proven that F1 of averages (denoted as $\mathbb{F}_1$ ) is always greater than or equal to averaged F1 ( $\mathcal{F}_1$ ). Further, a divergence between these scores emerges when there are differences between precision and recall within individual classes. The paper provides a theoretical bound for this divergence, which can reach up to 0.5 in specific configurations.
Implications for Classifier Evaluation: The divergence can lead to different classifier rankings. A classifier may outperform another according to one metric but not the other. This divergence is heightened in scenarios where classifiers demonstrate skewed error type distributions, exemplified by frequent type I/II errors.

Practical Implications

The authors delve into the practical consequences through numerical experiments, showcasing that the metric discrepancy is particularly pronounced in datasets with imbalanced class distributions. They emphasize that macro F1 is often deployed in such contexts with the intention of giving equal weight to all classes. However, when using the $\mathbb{F}_1$ formulation, biased classifiers that favor precision over recall (or vice versa) for certain classes can receive misleadingly high scores.

Recommendations

Given these findings, the paper advises the usage of $\mathcal{F}_1$ for evaluating classifiers in situations where balanced treatment of all classes is desired. This formulation is arguably more robust against biases introduced by skewed error type distributions.

Speculations and Future Directions

This paper's insights call for a re-examination of how macro F1 is employed in evaluating classification models, particularly within imbalanced datasets. Future research might focus on developing methodologies or evaluation metrics that offer a more nuanced balance between precision and recall across classes, ensuring evaluations align more closely with the specific goals of different application domains.

Additionally, this work underscores the need for transparency in reporting metric calculations within research, urging authors to clearly specify the macro F1 computation method employed. The implications extend to structuring fairer competitions and benchmarks in machine learning, where consistent and precise evaluations are paramount.

In conclusion, the paper by Opitz and Burst elucidates fundamental differences in macro F1 computation that were previously underappreciated, offering significant considerations for future AI research, application, and scholarly communication.