Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets (2401.16313v1)

Published 29 Jan 2024 in cs.CL

Abstract: Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgement but without any insights about their behaviour across different error types. Challenge sets are used to probe specific dimensions of metric behaviour but there are very few such datasets and they either focus on a limited number of phenomena or a limited number of language pairs. We introduce ACES, a contrastive challenge set spanning 146 language pairs, aimed at discovering whether metrics can identify 68 translation accuracy errors. These phenomena range from simple alterations at the word/character level to more complex errors based on discourse and real-world knowledge. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks. We benchmark metric performance, assess their incremental performance over successive campaigns, and measure their sensitivity to a range of linguistic phenomena. We also investigate claims that LLMs are effective as MT evaluators by evaluating on ACES. Our results demonstrate that different metric families struggle with different phenomena and that LLM-based methods fail to demonstrate reliable performance. Our analyses indicate that most metrics ignore the source sentence, tend to prefer surface-level overlap and end up incorporating properties of base models which are not always beneficial. We expand ACES to include error span annotations, denoted as SPAN-ACES and we use this dataset to evaluate span-based error metrics showing these metrics also need considerable improvement. Finally, we provide a set of recommendations for building better MT metrics, including focusing on error labels instead of scores, ensembling, designing strategies to explicitly focus on the source sentence, focusing on semantic content and choosing the right base model for representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (103)
  1. Robust MT evaluation with sentence-level multilingual augmentation. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 469–478, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  2. ACES: Translation accuracy challenge sets for evaluating machine translation metrics. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 479–513, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  3. Aces: Translation accuracy challenge sets at wmt 2023. In Proceedings of the Eighth Conference on Machine Translation, pages 693–710, Association for Computational Linguistics, Singapore.
  4. Amrhein, Chantal and Rico Sennrich. 2022. Identifying weaknesses in machine translation metrics through minimum bayes risk decoding: A case study for COMET. In 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, Association for Computational Linguistics, Online.
  5. Avramidis, Eleftherios and Vivien Macketanz. 2022. Linguistically motivated evaluation of machine translation metrics based on a challenge set. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 514–529, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  6. Fine-grained evaluation of quality estimation for machine translation based on a linguistically motivated test suite. In Proceedings of the AMTA 2018 Workshop on Translation Quality Estimation and Automatic Post-Editing, pages 243–248, Association for Machine Translation in the Americas, Boston, MA.
  7. Challenging the state-of-the-art machine translation metrics from a linguistic perspective. In Proceedings of the Eighth Conference on Machine Translation, pages 713–729, Association for Computational Linguistics, Singapore.
  8. Neural versus phrase-based machine translation quality: a case study. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 257–267, Association for Computational Linguistics, Austin, Texas.
  9. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 131–198, Association for Computational Linguistics, Berlin, Germany.
  10. Language models are few-shot learners. CoRR, abs/2005.14165.
  11. DiBiMT: A novel benchmark for measuring Word Sense Disambiguation biases in Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4331–4352, Association for Computational Linguistics, Dublin, Ireland.
  12. Extracting training data from large language models. In USENIX Security Symposium.
  13. Is neural machine translation the new state of the art? The Prague Bulletin of Mathematical Linguistics, 108:109–120.
  14. Exploring robustness of machine translation metrics: A study of twenty-two automatic metrics in the WMT22 metric task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 530–540, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  15. Instructeval: Towards holistic evaluation of instruction-tuned large language models.
  16. Scaling instruction-finetuned language models.
  17. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Association for Computational Linguistics, Brussels, Belgium.
  18. Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity Even better. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 36–50, Association for Computational Linguistics, Toronto, Canada.
  19. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Association for Computational Linguistics, Minneapolis, Minnesota.
  20. Embed_llama: Using llm embeddings for the metrics shared task. In Proceedings of the Eighth Conference on Machine Translation, pages 736–743, Association for Computational Linguistics, Singapore.
  21. Tokengram_f, a fast and accurate token-based chrf++ derivative. In Proceedings of the Eighth Conference on Machine Translation, pages 728–735, Association for Computational Linguistics, Singapore.
  22. Faith and fate: Limits of transformers on compositionality. ArXiv, abs/2305.18654.
  23. ElNokrashy, Muhammad and Tom Kocmi. 2023. ebleu: Unexpectedly good machine translation evaluation using simple word embeddings. In Proceedings of the Eighth Conference on Machine Translation, pages 744–748, Association for Computational Linguistics, Singapore.
  24. Emelin, Denis and Rico Sennrich. 2021. Wino-X: Multilingual Winograd schemas for commonsense reasoning and coreference resolution. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8517–8532, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
  25. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48.
  26. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. In Proceedings of the Eighth Conference on Machine Translation, pages 1066–1083, Association for Computational Linguistics, Singapore.
  27. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
  28. Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent. In Proceedings of the Eighth Conference on Machine Translation, pages 578–628, Association for Computational Linguistics, Singapore.
  29. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 46–68, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  30. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation, pages 733–774, Association for Computational Linguistics, Online.
  31. Cometoid: Distilling strong reference-based machine translation metrics into even stronger quality estimation metrics. In Proceedings of the Eighth Conference on Machine Translation, pages 749–753, Association for Computational Linguistics, Singapore.
  32. Larger-scale transformers for multilingual masked language modeling. pages 29–33.
  33. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
  34. xcomet: Transparent machine translation evaluation through fine-grained error detection.
  35. Guillou, Liane and Christian Hardmeier. 2016. PROTEST: A test suite for evaluating pronouns in machine translation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 636–643, European Language Resources Association (ELRA), Portorož, Slovenia.
  36. A pronoun test suite evaluation of the English–German MT systems at WMT 2018. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 570–577, Association for Computational Linguistics, Belgium, Brussels.
  37. Hanna, Michael and Ondřej Bojar. 2021. A fine-grained analysis of BERTScore. In Proceedings of the Sixth Conference on Machine Translation, pages 507–517, Association for Computational Linguistics, Online.
  38. A challenge set approach to evaluating machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2486–2496, Association for Computational Linguistics, Copenhagen, Denmark.
  39. Jia, Robin and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Association for Computational Linguistics, Copenhagen, Denmark.
  40. Metricx-23: The google submission to the wmt 2023 metrics shared task. In Proceedings of the Eighth Conference on Machine Translation, pages 754–765, Association for Computational Linguistics, Singapore.
  41. DEMETR: Diagnosing evaluation metrics for translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9540–9561, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates.
  42. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262, Association for Computational Linguistics, New Orleans, Louisiana.
  43. King, Margaret and Kirsten Falkedal. 1990. Using test suites in evaluation of machine translation systems. In COLING 1990 Volume 2: Papers presented to the 13th International Conference on Computational Linguistics.
  44. Kocmi, Tom and Christian Federmann. 2023a. Gemba-mqm: Detecting translation quality error spans with gpt-4. In Proceedings of the Eighth Conference on Machine Translation, pages 766–773, Association for Computational Linguistics, Singapore.
  45. Kocmi, Tom and Christian Federmann. 2023b. Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 193–203, European Association for Machine Translation, Tampere, Finland.
  46. To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. In Proceedings of the Sixth Conference on Machine Translation, pages 478–494, Association for Computational Linguistics, Online.
  47. MS-COMET: More and Better Human Judgements Improve Metric Performance. In Proceedings of the Seventh Conference on Machine Translation, Association for Computational Linguistics, Abu Dhabi.
  48. Koehn, Philipp. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand.
  49. Koehn, Philipp and Christof Monz. 2006. Manual and automatic evaluation of machine translation between European languages. In Proceedings on the Workshop on Statistical Machine Translation, pages 102–121, Association for Computational Linguistics, New York City.
  50. Kudo, Taku and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Association for Computational Linguistics, Brussels, Belgium.
  51. Laali, Majid and Leila Kosseim. 2017. Improving discourse relation projection to build discourse annotated corpora. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 407–416, INCOMA Ltd., Varna, Bulgaria.
  52. ParCorFull: a parallel corpus annotated with full coreference. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan.
  53. BIBI system description: Building with CNNs and breaking with deep reinforcement learning. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 27–32, Association for Computational Linguistics, Copenhagen, Denmark.
  54. Partial Could Be Better Than Whole: HW-TSC 2022 Submission for the Metrics Shared Task. In Proceedings of the Seventh Conference on Machine Translation, Association for Computational Linguistics, Abu Dhabi.
  55. Lo, Chi-kiu. 2019. YiSi - a unified semantic MT quality evaluation and estimation metric for languages with different levels of available resources. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 507–513, Association for Computational Linguistics, Florence, Italy.
  56. Metric score landscape challenge (mslc23): Understanding metrics’ performance on a wider landscape of translation quality. In Proceedings of the Eighth Conference on Machine Translation, pages 774–797, Association for Computational Linguistics, Singapore.
  57. Multidimensional quality metrics (mqm): A framework for declaring and describing translation quality metrics. Tradumàtica: tecnologies de la traducció, 0:455–463.
  58. Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt.
  59. Breaking NLP: Using morphosyntax, semantics, pragmatics and world knowledge to fool sentiment analysis systems. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 33–39, Association for Computational Linguistics, Copenhagen, Denmark.
  60. McCoy, Richard T and Tal Linzen. 2019. Non-entailed subsequences as a challenge for natural language inference. Proceedings of the Society for Computation in Linguistics (SCiL), pages 358–360.
  61. Extrinsic evaluation of machine translation metrics. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13060–13078, Association for Computational Linguistics, Toronto, Canada.
  62. Mukherjee, Ananya and Manish Shrivastava. 2022. REUSE: REference-free UnSupervised quality Estimation Metric. In Proceedings of the Seventh Conference on Machine Translation, Association for Computational Linguistics, Abu Dhabi.
  63. Mukherjee, Ananya and Manish Shrivastava. 2023. Mee4 and xlsim : Iiit hyd’s submissions’ for wmt23 metrics shared task. In Proceedings of the Eighth Conference on Machine Translation, pages 798–803, Association for Computational Linguistics, Singapore.
  64. Neubig, Graham. 2022. Is my nlp model working? the answer is harder than you think.
  65. No language left behind: Scaling human-centered machine translation.
  66. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA.
  67. MaTESe: Machine translation evaluation as a sequence tagging problem. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 569–577, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  68. Machine Translation Evaluation as a Sequence Tagging Problem. In Proceedings of the Seventh Conference on Machine Translation, Association for Computational Linguistics, Abu Dhabi.
  69. Popović, Maja. 2017. chrF++: words helping character n-grams. In Proceedings of the Second Conference on Machine Translation, pages 612–618, Association for Computational Linguistics, Copenhagen, Denmark.
  70. Popović, Maja and Sheila Castilho. 2019. Challenge test sets for MT evaluation. In Proceedings of Machine Translation Summit XVII: Tutorial Abstracts, European Association for Machine Translation, Dublin, Ireland.
  71. The MuCoW test suite at WMT 2019: Automatically harvested multilingual contrastive word sense disambiguation test sets for machine translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 470–480, Association for Computational Linguistics, Florence, Italy.
  72. NoiseQA: Challenge set evaluation for user-centric question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2976–2992, Association for Computational Linguistics, Online.
  73. The inside story: Towards better understanding of machine translation neural evaluation metrics. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1089–1105, Association for Computational Linguistics, Toronto, Canada.
  74. COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task. In Proceedings of the Seventh Conference on Machine Translation, Association for Computational Linguistics, Abu Dhabi.
  75. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Association for Computational Linguistics, Online.
  76. Unbounded dependency recovery for parser evaluation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 813–821, Association for Computational Linguistics, Singapore.
  77. The word sense disambiguation test suite at WMT18. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 588–596, Association for Computational Linguistics, Belgium, Brussels.
  78. Fancy: A diagnostic data-set for nli models. In Proceedings of the Eighth Italian Conference on Computational Linguistics (CLiC-it).
  79. Social bias in elicited natural language inferences. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pages 74–79, Association for Computational Linguistics, Valencia, Spain.
  80. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  81. Learning to evaluate translation beyond English: BLEURT submissions to the WMT metrics 2020 shared task. In Proceedings of the Fifth Conference on Machine Translation, pages 921–927, Association for Computational Linguistics, Online.
  82. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Association for Computational Linguistics, Berlin, Germany.
  83. Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2888–2913, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
  84. Smith, Noah A. 2012. Adversarial evaluation for models of natural language. CoRR, abs/1207.0245.
  85. Findings of the WMT 2020 shared task on machine translation robustness. In Proceedings of the Fifth Conference on Machine Translation, pages 76–91, Association for Computational Linguistics, Online.
  86. Staliūnaitė, Ieva and Ben Bonfil. 2017. Breaking sentiment analysis of movie reviews. In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems, pages 61–64, Association for Computational Linguistics, Copenhagen, Denmark.
  87. Evaluating gender bias in machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1679–1684, Association for Computational Linguistics, Florence, Italy.
  88. CrossQE: HW-TSC 2022 submission for the quality estimation shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 646–652, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  89. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  90. Toral, Antonio and Víctor M. Sánchez-Cartagena. 2017. A multifaceted evaluation of neural versus phrase-based machine translation for 9 language directions. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1063–1073, Association for Computational Linguistics, Valencia, Spain.
  91. Llama 2: Open foundation and fine-tuned chat models.
  92. Vamvas, Jannis and Rico Sennrich. 2021. Contrastive conditioning for assessing disambiguation in MT: A case study of distilled bias. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10246–10265, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
  93. Vamvas, Jannis and Rico Sennrich. 2022. As little as possible, as much as necessary: Detecting over- and undertranslations with contrastive conditioning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 490–500, Association for Computational Linguistics, Dublin, Ireland.
  94. Understanding the societal impacts of machine translation: a critical review of the literature on medical and legal use cases. Information, Communication & Society, 24(11):1515–1532.
  95. Alibaba-Translate China’s Submission for WMT2022 Metrics Shared Task. In Proceedings of the Seventh Conference on Machine Translation, Association for Computational Linguistics, Abu Dhabi.
  96. UniTE: Unified translation evaluation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8117–8127, Association for Computational Linguistics, Dublin, Ireland.
  97. Wu, Shijie and Mark Dredze. 2019. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 833–844, Association for Computational Linguistics, Hong Kong, China.
  98. Empowering a metric with llm-assisted named entity annotation: Hw-tsc’s submission to the wmt23 metrics shared task. In Proceedings of the Eighth Conference on Machine Translation, pages 820–826, Association for Computational Linguistics, Singapore.
  99. INSTRUCTSCORE: Towards explainable text generation evaluation with automatic feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5967–5994, Association for Computational Linguistics, Singapore.
  100. PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692, Association for Computational Linguistics, Hong Kong, China.
  101. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net.
  102. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 15–20, Association for Computational Linguistics, New Orleans, Louisiana.
  103. PIE: A parallel idiomatic expression corpus for idiomatic sentence generation and paraphrasing. In Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021), pages 33–48, Association for Computational Linguistics, Online.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Nikita Moghe (12 papers)
  2. Arnisa Fazla (1 paper)
  3. Chantal Amrhein (13 papers)
  4. Tom Kocmi (29 papers)
  5. Mark Steedman (36 papers)
  6. Alexandra Birch (67 papers)
  7. Rico Sennrich (87 papers)
  8. Liane Guillou (18 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com