Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies (2401.06760v2)

Published 12 Jan 2024 in cs.CL

Abstract: Ten years ago a single metric, BLEU, governed progress in machine translation research. For better or worse, there is no such consensus today, and consequently it is difficult for researchers to develop and retain the kinds of heuristic intuitions about metric deltas that drove earlier research and deployment decisions. This paper investigates the "dynamic range" of a number of modern metrics in an effort to provide a collective understanding of the meaning of differences in scores both within and among metrics; in other words, we ask what point difference X in metric Y is required between two systems for humans to notice? We conduct our evaluation on a new large dataset, ToShip23, using it to discover deltas at which metrics achieve system-level differences that are meaningful to humans, which we measure by pairwise system accuracy. We additionally show that this method of establishing delta-accuracy is more stable than the standard use of statistical p-values in regards to testset size. Where data size permits, we also explore the effect of metric deltas and accuracy across finer-grained features such as translation direction, domain, and system closeness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. ACES: Translation accuracy challenge sets for evaluating machine translation metrics. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 479–513, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  2. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  3. Anja Belz and Ehud Reiter. 2006. Comparing automatic and human evaluation of NLG systems. In 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 313–320, Trento, Italy. Association for Computational Linguistics.
  4. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 131–198, Berlin, Germany. Association for Computational Linguistics.
  5. Re-evaluating the role of Bleu in machine translation research. In 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 249–256, Trento, Italy. Association for Computational Linguistics.
  6. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  7. Etienne Denoual and Yves Lepage. 2005. BLEU in characters: Towards automatic MT evaluation in languages without word delimiters. In Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Proceedings of the Eighth Conference on Machine Translation, pages 1066–1083, Singapore. Association for Computational Linguistics.
  10. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
  11. Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent. In Proceedings of the Eighth Conference on Machine Translation, pages 578–628, Singapore. Association for Computational Linguistics.
  12. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  13. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation, pages 733–774, Online. Association for Computational Linguistics.
  14. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
  15. Statistical tests, p values, confidence intervals, and power: a guide to misinterpretations. European journal of epidemiology, 31:337–350.
  16. Eduard Hovy and Deepak Ravichandran. 2003. Holy and unholy grails. In Proceedings of Machine Translation Summit IX: Plenaries, New Orleans, USA.
  17. Ken Kelley and Kristopher J Preacher. 2012. On effect size. Psychological methods, 17(2):137.
  18. Findings of the 2022 conference on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  19. Tom Kocmi and Christian Federmann. 2023. GEMBA-MQM: Detecting translation quality error spans with GPT-4. In Proceedings of the Eighth Conference on Machine Translation, pages 768–775, Singapore. Association for Computational Linguistics.
  20. To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. In Proceedings of the Sixth Conference on Machine Translation, pages 478–494, Online. Association for Computational Linguistics.
  21. Kenneth Levenberg. 1944. A method for the solution of certain non-linear problems in least squares. Quarterly of Applied Mathematics.
  22. Beyond correlation: making sense of the score differences of new mt evaluation metrics. In Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track, pages 186–199.
  23. Benjamin Marie. 2022. Yes, we need statistical significance testing.
  24. Scientific credibility of machine translation research: A meta-evaluation of 769 papers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7297–7306, Online. Association for Computational Linguistics.
  25. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computational Linguistics.
  26. Results of the WMT20 metrics shared task. In Proceedings of the Fifth Conference on Machine Translation, pages 688–725, Online. Association for Computational Linguistics.
  27. Evaluating domain-specific metric thresholds: an empirical study. In Proceedings of the 2018 International Conference on Technical Debt, pages 41–50.
  28. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  29. Luke Plonsky and Frederick L Oswald. 2014. How big is “big”? interpreting effect sizes in l2 research. Language Learning.
  30. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
  31. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  32. Learning compact metrics for MT. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 751–762, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  33. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  34. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
  35. Finding software metrics threshold values using roc curves. J. Softw. Maint. Evol., 22(1):1–16.
  36. Attaining the unattainable? reassessing claims of human parity in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 113–123, Brussels, Belgium. Association for Computational Linguistics.
  37. Vilém Zouhar and Ondřej Bojar. 2024. Quality and quantity of machine translation references for automated metrics.
Citations (18)

Summary

  • The paper’s main contribution is quantifying metric deltas that correlate with human judgment on machine translation quality.
  • It employs a binning approach on the ToShip23 dataset to analyze varied score ranges across traditional and modern metrics.
  • Findings indicate that different metrics require distinct score differences to reach human agreement, ensuring robustness across languages.

Introduction

The field of machine translation (MT) has evolved significantly over the past decade, with the once-dominant BLEU metric giving way to a variety of sophisticated, deep-learning-based metrics. The proliferation of these metrics presents a challenge to researchers: understanding the "dynamic range" of these tools and the "metric delta," which represents a meaningful shift in performance between translation systems. The goal of understanding metric deltas is central to this paper, particularly determining what metric score difference is necessary before humans will notice a discrepancy between systems.

Experimental Approach

In order to shed light on metric deltas, the researchers introduced a new human evaluation dataset called ToShip23, touting a larger size and richer details than its predecessors. Against this backdrop, the key metrics under evaluation included both traditional options like BLEU and ChrF, as well as newer entrants such as BLEURT and COMET. Notably, the team sought to steer away from LLMs, due to their reliance on non-public models, and instead focused on metrics that can be replicated and validated independently.

Metric Deltas and Accuracy

Intriguingly, this paper unearthed wide variations in the ranges of scores that different metrics provide. Some metrics exhibited similar scoring ranges, while others deviated significantly. The paper then adopted a binning approach on the ToShip23 dataset to explore the granularity of metric deltas. By grouping system pairs and contrasting them with human pairwise accuracy, a more nuanced understanding of when humans agree or disagree with metric rankings emerged.

Findings and Implications

The key results of the paper propose that metric deltas correlate variably with human-judged accuracy. For example, to reach a 70% agreement with human judgment, one would need around a 1.3 BLEU delta, but an 80% agreement would require a heftier 3.5 BLEU delta. In contrast, some contemporary metrics like CometKiwiQE22 achieve a 90% human agreement with a mere 0.9-point delta. The durability of these findings is further validated across different languages and directions of translation, suggesting a robustness that is crucial for practitioners.

The paper's implications are vast, opening the door to a tool that allows users to compare accuracies across different metric thresholds. For MT researchers, this could mean a significant step forward in accurately measuring system performance and understanding how margin of improvements in scores are perceived by human evaluators—a critical bridge between quantitative metrics and qualitative assessment.

Youtube Logo Streamline Icon: https://streamlinehq.com