A global analysis of metrics used for measuring performance in natural language processing

Published 25 Apr 2022 in cs.CL and cs.AI | (2204.11574v1)

Abstract: Measuring the performance of natural language processing models is challenging. Traditionally used metrics, such as BLEU and ROUGE, originally devised for machine translation and summarization, have been shown to suffer from low correlation with human judgment and a lack of transferability to other tasks and languages. In the past 15 years, a wide range of alternative metrics have been proposed. However, it is unclear to what extent this has had an impact on NLP benchmarking efforts. Here we provide the first large-scale cross-sectional analysis of metrics used for measuring performance in natural language processing. We curated, mapped and systematized more than 3500 machine learning model performance results from the open repository 'Papers with Code' to enable a global and comprehensive analysis. Our results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models' performance. Furthermore, we found that ambiguities and inconsistencies in the reporting of metrics may lead to difficulties in interpreting and comparing model performances, impairing transparency and reproducibility in NLP research.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (23)

View on Semantic Scholar

Summary

The paper provides a global analysis of over 3500 performance results, identifying prevalent and ambiguous metric usage in NLP benchmarking.
It employs hierarchical mapping and manual curation to reduce 812 metric strings to 187 canonical metrics, clarifying inconsistencies in reporting.
The findings highlight the need for precise metric reporting and the integration of traditional and emerging metrics to enhance evaluation robustness.

A Global Analysis of Metrics Used for Measuring Performance in Natural Language Processing

Introduction

Over the past two decades, benchmarking in NLP has evolved significantly, with an expanded repertoire of metrics used to evaluate the performance of various models. While traditional metrics like BLEU and ROUGE were initially dominant, their limitations such as low correlation with human judgment and poor task transferability have driven the development of numerous alternatives. The study conducted by Blagec et al. provides a detailed cross-sectional analysis of metric usage in NLP, focusing on over 3500 performance results sourced from "Papers with Code." The insights reveal not only the prevalent metrics in current practice but also the inherent challenges in their usage and reporting.

Methodology

The analysis derives its dataset from "Papers with Code," utilizing ITO to organize 812 different metric strings reported across numerous benchmarks. A crucial step involved the hierarchical mapping and manual curation of these metrics into canonical forms, reducing them to 187 distinct performance metrics. This process addressed ambiguities and inconsistencies such as synonymous metric names and variants within the same metric family.

Figure 1: Property hierarchy after manual curation of the raw list of metrics.

Results

Prevalent Metrics in NLP

Bleu score emerges as the most frequently reported metric in NLP benchmarking, used in 61.1% of the datasets. It is prominently employed in machine translation and generation tasks. ROUGE, another dominant metric used across text generation tasks, often correlates with BLEU in its applications. METEOR, despite its advantages in correlating with human judgment, shows limited adoption relative to BLEU and ROUGE.

Co-occurrence of Metrics

The study identifies BLEU as often reported in isolation, whereas ROUGE and METEOR tend to be used in conjunction with other metrics. The analysis of co-occurrence patterns demonstrates that while BLEU and ROUGE are frequently used together, METEOR more consistently accompanies other metrics, reflecting an integrated evaluation approach.

Figure 2: Co-occurrence matrix for the top 10 most frequently used NLP metrics.

Inconsistencies and Ambiguities

The manual curation process underscored significant inconsistencies in metric reporting. For example, ROUGE metrics are often ambiguously reported without specifying the variant, leading to potential misinterpretations when comparing studies. The study highlights the need for clarity and specificity in reporting metrics to facilitate consistent evaluation and replication.

Discussion

The continued reliance on traditional metrics, despite their limitations, poses challenges for accurately assessing NLP model performance. While newer metrics like METEOR++ 2.0 and BERTScore offer enhanced correlation with human judgment and address some of the shortcomings of predecessors, their adoption remains less widespread. The dynamic nature of benchmarks and the evolving landscape of NLP necessitate frameworks that account for multi-dimensional aspects of evaluation, such as robustness and fairness, alongside traditional performance measures.

Figure 3: Timeline of the introduction of NLP metrics and their original application.

Recommendations and Future Directions

To advance the state of NLP benchmarking, the study recommends:

Ensuring precise reporting by clearly stating metric variants and adaptations.
Leveraging comprehensive metrics that capture diverse performance aspects beyond n-gram matching.
Increasing the use of reference implementations to enhance reproducibility.

Emerging concepts like bidimensional leaderboards and evaluation-as-a-service frameworks (e.g., Dynascore) represent promising directions, allowing for customizable and holistic evaluation strategies that could incorporate user-defined priorities, thus enhancing the utility of benchmarks amid ongoing advancements in NLP.

Conclusion

The paper provides pivotal insights into the complexities of metric usage in NLP, emphasizing the importance of consistent, clear metric reporting and the adoption of multifaceted evaluation metrics. Future research should focus on developing comprehensive frameworks that integrate newer metrics with traditional ones, offering robust evaluations adaptable to different NLP domains and tasks.