Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quality and Quantity of Machine Translation References for Automatic Metrics (2401.01283v5)

Published 2 Jan 2024 in cs.CL

Abstract: Automatic machine translation metrics typically rely on human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average (or maximum) helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.

Analyzing the Quality and Quantity of Machine Translation References for Automatic Metrics

The paper "Quality and Quantity of Machine Translation References for Automatic Metrics" critically examines the influence of reference quality and quantity on the efficacy of automatic machine translation (MT) evaluation metrics. The research is centralized around a key query: How do varying levels of reference quality and the inclusion of additional reference translations impact the performance of automated MT metrics, and what is the optimal balance when faced with a budget constraint?

The research acknowledges the prevailing dependence of automatic MT evaluation on high-quality human reference translations. Human annotators are often recognized as the gold standard; however, this method is neither scalable nor replicable, and automatic metrics have emerged as a viable alternative by quantitatively correlating machine-generated translations with human assessments.

The researchers conducted experiments using data in an English-to-Czech translation task, incorporating references of differing qualities, from standard translation vendors to academia-level fidelity. The data was re-annotated to provide a multi-reference dataset, enabling a thorough examination of these factors.

Findings

  1. Impact of Reference Quality: The paper establishes that the reference quality considerably influences metric performance. References of poor quality, as anticipated, degrade the correlation between automatic scores and human assessments. The research cites examples where highest-quality human translations (termed "optimal reference translations") do not always yield the highest correlation scores, potentially due to translation shifts that challenge surface-level matching in metrics like BLEU.
  2. Advantage of Multiple References: Incorporating multiple references, particularly when averaged or maximized, significantly enhances metric correlations. Improvements plateau after incorporating around seven references, beyond which the benefit diminishes. This finding aligns with previous research indicating that the number of references can substitute for or complement test set size in enhancing metric reliability.
  3. Budget Allocation for References: Importantly, the paper presents an algorithmic approach to optimize reference selection under a budget constraint, balancing the quality and number of references to maximize metric success. This pragmatic frame helps navigate the trade-off between high-quality and cost intensiveness.

Implications

Practically, this paper informs MT practitioners on the cost-effective construction of reference corpora by elucidating that a mixture of reference qualities can still improve metric efficacy. It underscores the dual significance of reference quality and quantity in automated MT evaluation, advocating for a strategic selection informed by available resources and project demands.

Theoretically, the findings invite further research into robust metric design that can account for subtle translation shifts and better align with semantic rather than surface-level similarities. The adaptability of current metrics to high-quality human translations that deviate from translationese remains a sector for exploration, especially in an era where MT models increasingly emulate human-like fluency.

Speculations on Future Developments

Looking forward, the ongoing advancements in MT systems—particularly in dealing with nuanced semantic shifts—could drive the evolution of metrics that better harness sophisticated reference translations. Additionally, given the growing computational abilities to generate synthetic references, future work could explore leveraging neural MT systems to produce auxiliary references, potentially enhancing reference diversity cost-effectively.

In conclusion, this research contributes a detailed analysis and methodology for optimizing reference use in MT evaluation, setting a foundational approach for future community standards and helping guide resource allocation in the development of MT evaluation benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Findings of the conference on machine translation. In Proceedings of the Fifth Conference on Machine Translation, 1–55. Association for Computational Linguistics.
  2. Approaches to human and machine translation quality assessment. Translation quality assessment: From principles to practice.
  3. Joke Daems and Lieve Macken. 2020. Post-editing human translations and revising machine translations: Impact on efficiency and quality. In Translation Revision and Post-editing, 50–70. Routledge.
  4. How does automatic machine translation evaluation correlate with human scoring as the number of reference translations increases? In LREC.
  5. Marina Fomicheva. 2017. The role of human reference translation in machine translation evaluation. Ph.D. thesis, Universitat Pompeu Fabra.
  6. Multi-hypothesis machine translation evaluation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 1218–1232.
  7. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
  8. BLEU might be guilty but references are not innocent. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 61–71. Association for Computational Linguistics.
  9. Results of WMT23 metrics shared task: Metrics might be guilty but references are not innocent. In Proceedings of the Eighth Conference on Machine Translation, 578–628. Association for Computational Linguistics.
  10. Results of WMT22 metrics shared task: Stop using BLEU – neural metrics are better and more robust. In Proceedings of the Seventh Conference on Machine Translation, 46–68. Association for Computational Linguistics.
  11. Results of the WMT21 metrics shared task: Evaluating metrics with expert-based human evaluations on TED and news domain. In Proceedings of the Sixth Conference on Machine Translation, 733–774. Association for Computational Linguistics.
  12. Active learning for interactive machine translation. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 245–254. Association for Computational Linguistics.
  13. Active learning for statistical phrase-based machine translation. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 415–423. Association for Computational Linguistics.
  14. Olivier Hamon and Djamel Mostefa. 2008. The impact of reference quality on automatic MT evaluation. In Coling 2008: Companion volume: Posters, 39–42. Coling 2008 Organizing Committee.
  15. A challenge set approach to evaluating machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2486–2496. Association for Computational Linguistics.
  16. Simulated multiple reference training improves low-resource machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 82–89. Association for Computational Linguistics.
  17. To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. In Proceedings of the Sixth Conference on Machine Translation, 478–494. Association for Computational Linguistics.
  18. Arle Lommel. 2016. Blues for BLEU: Reconsidering the validity of reference-based MT evaluation. Translation Evaluation: From Fragmented Tools and Data Sets to an Integrated Ecosystem,  63.
  19. Are multiple reference translations necessary? investigating the value of paraphrased reference translations in parameter optimization. In Proceedings of the 8th Conference of the Association for Machine Translation in the Americas, 143–152.
  20. Onception: Active Learning with Expert Advice for Real World Machine Translation. Computational Linguistics, 49(2):325–372.
  21. Improving adversarial neural machine translation for morphologically rich language. IEEE Transactions on Emerging Topics in Computational Intelligence, 4(4):417–426.
  22. The impact of reference normalization on automatic mt evaluation. In 6th International Symposium on Telecommunications, 811–816.
  23. Analyzing uncertainty in neural machine translation. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 3956–3965. PMLR.
  24. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. Association for Computational Linguistics.
  25. Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, 392–395. Association for Computational Linguistics.
  26. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, 186–191. Association for Computational Linguistics.
  27. Ying Qin and Lucia Specia. 2015. Truly exploring multiple references for machine translation evaluation. In Proceedings of the 18th Annual Conference of the European Association for Machine Translation. European Association for Machine Translation.
  28. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation, 578–585. Association for Computational Linguistics.
  29. Are references really needed? unbabel-IST 2021 submission for the metrics shared task. In Proceedings of the Sixth Conference on Machine Translation, 1030–1040. Association for Computational Linguistics.
  30. COMET: A neural framework for MT evaluation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2685–2702. Association for Computational Linguistics.
  31. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7881–7892. Association for Computational Linguistics.
  32. Zhouxing Shi and Minlie Huang. 2020. Robustness to modification with shared words in paraphrase identification. In Findings of the Association for Computational Linguistics: EMNLP 2020, 164–171. Association for Computational Linguistics.
  33. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, 223–231. Association for Machine Translation in the Americas.
  34. Lucia Specia and Kashif Shah. 2018. Machine translation quality estimation: Applications and future perspectives. Translation quality assessment: from principles to practice, 201–235.
  35. Dynamic data selection for neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1400–1410. Association for Computational Linguistics.
  36. Embarrassingly easy document-level MT metrics: How to convert any pretrained metric into a document-level metric. In Proceedings of the Seventh Conference on Machine Translation (WMT), 118–128. Association for Computational Linguistics.
  37. Ying Zhang and Stephan Vogel. 2004. Measuring confidence intervals for the machine translation evaluation metrics. In Proceedings of the 10th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages.
  38. Ying Zhang and Stephan Vogel. 2010. Significance tests of automatic machine translation evaluation metrics. Machine Translation, 24:51–65.
  39. Multi-reference training with pseudo-references for neural translation and text generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 3188–3197. Association for Computational Linguistics.
  40. Evaluating optimal reference translations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Vilém Zouhar (41 papers)
  2. Ondřej Bojar (91 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com