Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Evaluating Optimal Reference Translations (2311.16787v2)

Published 28 Nov 2023 in cs.CL

Abstract: The overall translation quality reached by current machine translation (MT) systems for high-resourced language pairs is remarkably good. Standard methods of evaluation are not suitable nor intended to uncover the many translation errors and quality deficiencies that still persist. Furthermore, the quality of standard reference translations is commonly questioned and comparable quality levels have been reached by MT alone in several language pairs. Navigating further research in these high-resource settings is thus difficult. In this article, we propose a methodology for creating more reliable document-level human reference translations, called "optimal reference translations," with the simple aim to raise the bar of what should be deemed "human translation quality." We evaluate the obtained document-level optimal reference translations in comparison with "standard" ones, confirming a significant quality increase and also documenting the relationship between evaluation and translation editing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Findings of the IWSLT 2023 evaluation campaign. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pp. 1–61, Toronto, Canada (in-person and online). Association for Computational Linguistics.
  2. Castilho, S. 2022. DELA project: Document-level machine translation evaluation. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, pp. 319–320.
  3. Daneš, F. 1974. Functional sentence perspective and the organization of the text. Papers on functional sentence perspective, 23:106–128.
  4. Firbas, J. 1992. Functional Sentence Perspective in Written and Spoken Communication. Studies in English Language. Cambridge University Press.
  5. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
  6. Results of WMT22 metrics shared task: Stop using BLEU-neural metrics are better and more robust.
  7. TANDO: A corpus for document-level machine translation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 3026–3037.
  8. Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pp. 33–41, Sofia, Bulgaria.
  9. Guerberof, A. 2017. Quality is in the eyes of the reviewer.
  10. Topic-Focus Articulation, Tripartite Structures, and Semantic Content. Springer.
  11. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567.
  12. House, J. 2001. Translation quality assessment: Linguistic description versus social evaluation. Meta, 46(2):243–257.
  13. The perils of using mechanical turk to evaluate open-ended text generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1265–1285.
  14. Možnosti a meze tvorby tzv. optimálních referenčních překladů: Po stopách překladatelštiny v „profesionálních“ překladech zpravodajských textů (possibilities and limitations of optimal reference translations: Exploring translationese in “professional” translations of newspaper articles). Slovo a Slovesnost.
  15. Findings of the 2022 conference on machine translation (WMT22).
  16. How far do we agree on the quality of translation? English Studies at NBU, 1(1):18–31.
  17. Consistent human evaluation of machine translation across language pairs. arXiv preprint arXiv:2205.08533.
  18. Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics. Revista Tradumàtica: tecnologies de la traducció, pp. 455–463.
  19. A simple and effective unified encoder for document-level machine translation. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 3505–3511.
  20. A survey on document-level machine translation: Methods and evaluation. arXiv preprint arXiv:1912.08494,  5.
  21. Transforming machine translation: A deep learning system reaches news translation quality comparable to human professionals. Nature communications, 11(1):1–15.
  22. Popović, M. 2020. Informative manual evaluation of machine translation output. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 5059–5069.
  23. Automated evaluation metric for terminology consistency in MT.
  24. The meaning of the sentence in its semantic and pragmatic aspects. Springer Science & Business Media.
  25. Predicting human translation quality. In Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track, pp. 288–300, Vancouver, Canada. Association for Machine Translation in the Americas.
  26. Yuan, Y. 2018. Human Translation Quality Estimation: Feature-based and Deep Learning-based. PhD thesis, University of Leeds.
  27. WMT20 document-level markable error exploration. In Proceedings of the Fifth Conference on Machine Translation, pp. 371–380.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com