Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Machine Translation Models are Zero-Shot Detectors of Translation Direction (2401.06769v2)

Published 12 Jan 2024 in cs.CL

Abstract: Detecting the translation direction of parallel text has applications for machine translation training and evaluation, but also has forensic applications such as resolving plagiarism or forgery allegations. In this work, we explore an unsupervised approach to translation direction detection based on the simple hypothesis that $p(\text{translation}|\text{original})>p(\text{original}|\text{translation})$, motivated by the well-known simplification effect in translationese or machine-translationese. In experiments with massively multilingual machine translation models across 20 translation directions, we confirm the effectiveness of the approach for high-resource language pairs, achieving document-level accuracies of 82--96% for NMT-produced translations, and 60--81% for human translations, depending on the model used. Code and demo are available at https://github.com/ZurichNLP/translation-direction-detection

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 131–198, Berlin, Germany. Association for Computational Linguistics.
  2. It’s easier to translate out of English than into it: Measuring neural translation difficulty by cross-mutual information. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1640–1649, Online. Association for Computational Linguistics.
  3. Are all languages equally hard to language-model? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 536–541, New Orleans, Louisiana. Association for Computational Linguistics.
  4. Uwe Ebbinghaus. 2022. Geschichte eines Vernichtungsversuchs. Frankfurter Allgemeine Zeitung.
  5. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22:1–38.
  6. The Flores-101 evaluation benchmark for low-resource and multilingual machine translation. Transactions of the Association for Computational Linguistics, 10:522–538.
  7. Statistical power and translationese in machine translation evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 72–81, Online. Association for Computational Linguistics.
  8. Marcin Junczys-Dowmunt. 2018. Dual conditional cross-entropy filtering of noisy parallel corpora. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 888–895, Belgium, Brussels. Association for Computational Linguistics.
  9. Findings of the 2023 conference on machine translation (WMT23): LLMs are here but not quite there yet. In Proceedings of the Eighth Conference on Machine Translation, pages 1–42, Singapore. Association for Computational Linguistics.
  10. Findings of the 2022 conference on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 1–45, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
  11. Automatic detection of translated text and its impact on machine translation. In Proceedings of Machine Translation Summit XII: Papers, Ottawa, Canada.
  12. Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 881–893, Valencia, Spain. Association for Computational Linguistics.
  13. SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, pages 8348–8359.
  14. Original or translated? a causal analysis of the impact of translationese on machine translation performance. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5303–5320, Seattle, United States. Association for Computational Linguistics.
  15. Sergiu Nisioi. 2015. Unsupervised classification of translated texts. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9103:323–334.
  16. Matt Post and Marcin Junczys-Dowmunt. 2023. Escaping the sentence-level paradigm in machine translation.
  17. Ella Rabinovich and Shuly Wintner. 2015. Unsupervised Identification of Translationese. Transactions of the Association for Computational Linguistics, 3:419–432.
  18. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computational Linguistics.
  19. Ilia Sominsky and Shuly Wintner. 2019. Automatic detection of translation direction. International Conference Recent Advances in Natural Language Processing, RANLP, 2019-Septe:1131–1140.
  20. No language left behind: Scaling human-centered machine translation.
  21. Elke Teich. 2003. A Methodology for the Investigation of Translations and Comparable Texts. De Gruyter Mouton, Berlin, Boston.
  22. Brian Thompson and Matt Post. 2020. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pages 90–121.
  23. Jörg Tiedemann and Ona de Gibert. 2023. The OPUS-MT dashboard – a toolkit for a systematic evaluation of open machine translation models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 315–327, Toronto, Canada. Association for Computational Linguistics.
  24. Antonio Toral. 2019. Post-editese: an exacerbated translationese. In Proceedings of Machine Translation Summit XVII: Research Track, pages 273–281, Dublin, Ireland. European Association for Machine Translation.
  25. Jannis Vamvas and Rico Sennrich. 2022. NMTScore: A multilingual analysis of translation-based text similarity measures. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 198–213, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  26. Lost in translation: Loss and decay of linguistic richness in machine translation. In Proceedings of Machine Translation Summit XVII: Research Track, pages 222–232, Dublin, Ireland. European Association for Machine Translation.
  27. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  28. On the features of translationese. Digital Scholarship in the Humanities, 30(1):98–118.
  29. Stefan Weber. 2022. Gutachten zur Einhaltung der Regeln guten wissenschaftlichen Arbeitens in der Dissertation „Untersuchung zur Chemotaxis von Fibrosarkomzellen in vitro“ von Prof. Dr. med. Matthias Graw Universität Hamburg, 1987. Technical report, Salzburg.
  30. Wikipedia. 2023. Colchicine – 100 years of research — Wikipedia, die freie Enzyklopädie.
  31. Jochen Zenthöfer. 2022. Chronik einer Plagiats-Intrige. Frankfurter Allgemeine Zeitung.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Michelle Wastl (2 papers)
  2. Jannis Vamvas (16 papers)
  3. Rico Sennrich (88 papers)

Summary

We haven't generated a summary for this paper yet.