Papers
Topics
Authors
Recent
2000 character limit reached

Sentence-level Aggregation of Lexical Metrics Correlates Stronger with Human Judgements than Corpus-level Aggregation (2407.12832v2)

Published 3 Jul 2024 in cs.CL

Abstract: In this paper we show that corpus-level aggregation hinders considerably the capability of lexical metrics to accurately evaluate machine translation (MT) systems. With empirical experiments we demonstrate that averaging individual segment-level scores can make metrics such as BLEU and chrF correlate much stronger with human judgements and make them behave considerably more similar to neural metrics such as COMET and BLEURT. We show that this difference exists because corpus- and segment-level aggregation differs considerably owing to the classical average of ratio versus ratio of averages Mathematical problem. Moreover, as we also show, such difference affects considerably the statistical robustness of corpus-level aggregation. Considering that neural metrics currently only cover a small set of sufficiently-resourced languages, the results in this paper can help make the evaluation of MT systems for low-resource languages more trustworthy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. It’s easier to translate out of English than into it: Measuring neural translation difficulty by cross-mutual information. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1640–1649, Online, July 2020. Association for Computational Linguistics.
  2. BLASER: A text-free speech-to-speech translation evaluation metric. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9064–9079, Toronto, Canada, July 2023. Association for Computational Linguistics.
  3. Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the Eighth Conference on Machine Translation, pages 578–628, Singapore, December 2023. Association for Computational Linguistics.
  4. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics.
  5. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation, pages 733–774, Online, November 2021. Association for Computational Linguistics.
  6. Integrating language models into direct speech translation: An inference-time solution to control gender inflection. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11505–11517, Singapore, December 2023. Association for Computational Linguistics.
  7. Testing for significance of increased correlation with human judgment. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 172–176, Doha, Qatar, October 2014. Association for Computational Linguistics.
  8. Breeding machine translations: Evolutionary approach to survive and thrive in the world of automated evaluation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2191–2212, Toronto, Canada, July 2023. Association for Computational Linguistics.
  9. MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task. In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christof Monz, editors, Proceedings of the Eighth Conference on Machine Translation, pages 756–767, Singapore, December 2023. Association for Computational Linguistics.
  10. Philipp Koehn. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388–395, Barcelona, Spain, July 2004. Association for Computational Linguistics.
  11. Scheduled sampling based on decoding steps for neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3285–3296, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  12. High-quality data-to-text generation for severely under-resourced languages with out-of-the-box large language models. In Yvette Graham and Matthew Purver, editors, Findings of the Association for Computational Linguistics: EACL 2024, pages 1451–1461, St. Julian’s, Malta, March 2024. Association for Computational Linguistics.
  13. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online, July 2020. Association for Computational Linguistics.
  14. Evaluating robustness to input perturbations for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8538–8544, Online, July 2020. Association for Computational Linguistics.
  15. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
  16. Maja Popović. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal, September 2015. Association for Computational Linguistics.
  17. Maja Popović. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612–618, Copenhagen, Denmark, September 2017. Association for Computational Linguistics.
  18. Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium, October 2018. Association for Computational Linguistics.
  19. Searching for COMETINHO: The little metric that could. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pages 61–70, Ghent, Belgium, June 2022. European Association for Machine Translation.
  20. COMET: A neural framework for MT evaluation. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online, November 2020. Association for Computational Linguistics.
  21. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online, November 2020. Association for Computational Linguistics.
  22. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online, July 2020. Association for Computational Linguistics.
  23. Bleurt: Learning robust metrics for text generation. In Proceedings of ACL, 2020.
  24. UniTE: Unified translation evaluation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8117–8127, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  25. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.