Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies (2401.06760v2)

Published 12 Jan 2024 in cs.CL

Abstract: Ten years ago a single metric, BLEU, governed progress in machine translation research. For better or worse, there is no such consensus today, and consequently it is difficult for researchers to develop and retain the kinds of heuristic intuitions about metric deltas that drove earlier research and deployment decisions. This paper investigates the "dynamic range" of a number of modern metrics in an effort to provide a collective understanding of the meaning of differences in scores both within and among metrics; in other words, we ask what point difference X in metric Y is required between two systems for humans to notice? We conduct our evaluation on a new large dataset, ToShip23, using it to discover deltas at which metrics achieve system-level differences that are meaningful to humans, which we measure by pairwise system accuracy. We additionally show that this method of establishing delta-accuracy is more stable than the standard use of statistical p-values in regards to testset size. Where data size permits, we also explore the effect of metric deltas and accuracy across finer-grained features such as translation direction, domain, and system closeness.

References (37)

Citations (18)

View on Semantic Scholar

Summary

The paper’s main contribution is quantifying metric deltas that correlate with human judgment on machine translation quality.
It employs a binning approach on the ToShip23 dataset to analyze varied score ranges across traditional and modern metrics.
Findings indicate that different metrics require distinct score differences to reach human agreement, ensuring robustness across languages.

Introduction

The field of machine translation (MT) has evolved significantly over the past decade, with the once-dominant BLEU metric giving way to a variety of sophisticated, deep-learning-based metrics. The proliferation of these metrics presents a challenge to researchers: understanding the "dynamic range" of these tools and the "metric delta," which represents a meaningful shift in performance between translation systems. The goal of understanding metric deltas is central to this paper, particularly determining what metric score difference is necessary before humans will notice a discrepancy between systems.

Experimental Approach

In order to shed light on metric deltas, the researchers introduced a new human evaluation dataset called ToShip23, touting a larger size and richer details than its predecessors. Against this backdrop, the key metrics under evaluation included both traditional options like BLEU and ChrF, as well as newer entrants such as BLEURT and COMET. Notably, the team sought to steer away from LLMs, due to their reliance on non-public models, and instead focused on metrics that can be replicated and validated independently.

Metric Deltas and Accuracy

Intriguingly, this paper unearthed wide variations in the ranges of scores that different metrics provide. Some metrics exhibited similar scoring ranges, while others deviated significantly. The paper then adopted a binning approach on the ToShip23 dataset to explore the granularity of metric deltas. By grouping system pairs and contrasting them with human pairwise accuracy, a more nuanced understanding of when humans agree or disagree with metric rankings emerged.

Findings and Implications

The key results of the paper propose that metric deltas correlate variably with human-judged accuracy. For example, to reach a 70% agreement with human judgment, one would need around a 1.3 BLEU delta, but an 80% agreement would require a heftier 3.5 BLEU delta. In contrast, some contemporary metrics like CometKiwiQE22 achieve a 90% human agreement with a mere 0.9-point delta. The durability of these findings is further validated across different languages and directions of translation, suggesting a robustness that is crucial for practitioners.

The paper's implications are vast, opening the door to a tool that allows users to compare accuracies across different metric thresholds. For MT researchers, this could mean a significant step forward in accurately measuring system performance and understanding how margin of improvements in scores are perceived by human evaluators—a critical bridge between quantitative metrics and qualitative assessment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/KocmiTom/status/1774786036293640281

https://twitter.com/zouharvi/status/1746844038999732507

https://twitter.com/fly51fly/status/1747012197916844339

https://twitter.com/KocmiTom/status/1746842604447137947

YouTube

Show All Videos

Reddit

What MT metric gains mean and how much they can be trusted | Paper authored by Tom Kocmi, Vilém Zouhar, Christian Federmann and Matt Post (3 points, 1 comment)