Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Impact of Syntactic and Semantic Proximity on Machine Translation with Back-Translation (2403.18031v1)

Published 26 Mar 2024 in cs.CL

Abstract: Unsupervised on-the-fly back-translation, in conjunction with multilingual pretraining, is the dominant method for unsupervised neural machine translation. Theoretically, however, the method should not work in general. We therefore conduct controlled experiments with artificial languages to determine what properties of languages make back-translation an effective training method, covering lexical, syntactic, and semantic properties. We find, contrary to popular belief, that (i) parallel word frequency distributions, (ii) partially shared vocabulary, and (iii) similar syntactic structure across languages are not sufficient to explain the success of back-translation. We show however that even crude semantic signal (similar lexical fields across languages) does improve alignment of two languages through back-translation. We conjecture that rich semantic dependencies, parallel across languages, are at the root of the success of unsupervised methods based on back-translation. Overall, the success of unsupervised machine translation was far from being analytically guaranteed. Instead, it is another proof that languages of the world share deep similarities, and we hope to show how to identify which of these similarities can serve the development of unsupervised, cross-linguistic tools.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. A latent variable model approach to PMI-based word embeddings. Transactions of the Association for Computational Linguistics, 4:385–399, 2016. doi: 10.1162/tacl_a_00106. URL https://aclanthology.org/Q16-1028.
  2. Unsupervised neural machine translation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Sy2ogebAW.
  3. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017. doi: 10.1162/tacl_a_00051. URL https://aclanthology.org/Q17-1010.
  4. Improving translation model by monolingual data. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pp.  330–336, Edinburgh, Scotland, July 2011. Association for Computational Linguistics. URL https://aclanthology.org/W11-2138.
  5. Tagged back-translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp.  53–63, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-5206. URL https://aclanthology.org/W19-5206.
  6. Semi-supervised learning for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1965–1974, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1185. URL https://aclanthology.org/P16-1185.
  7. On the transferability of pre-trained language models: A study from artificial datasets. In AAAI Conference on Artificial Intelligence, 2021.
  8. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8440–8451, Online, July 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  9. Emerging cross-lingual structure in pretrained language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  6022–6034, Online, July 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.536. URL https://aclanthology.org/2020.acl-main.536.
  10. Explaining and generalizing back-translation through wake-sleep. ArXiv, abs/1806.04402, 2018.
  11. Dynamic data selection and weighting for iterative back-translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  5894–5904, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.475. URL https://aclanthology.org/2020.emnlp-main.475.
  12. Identifying elements essential for BERT’s multilinguality. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  4423–4437, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.358. URL https://aclanthology.org/2020.emnlp-main.358.
  13. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  489–500, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1045. URL https://aclanthology.org/D18-1045.
  14. Pre-trained language model representations for language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4052–4059, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1409. URL https://aclanthology.org/N19-1409.
  15. On the evaluation of machine translation systems trained with back-translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  2836–2846, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.253. URL https://aclanthology.org/2020.acl-main.253.
  16. The FLORES evaluation datasets for low-resource machine translation: Nepali–English and Sinhala–English. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  6098–6111, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1632. URL https://aclanthology.org/D19-1632.
  17. Survey of Low-Resource Machine Translation. Technical Report arXiv:2109.00486, arXiv, February 2022. URL http://arxiv.org/abs/2109.00486. arXiv:2109.00486 [cs] type: article.
  18. Dual learning for machine translation. NIPS16, pp.  820–828, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  19. Iterative back-translation for neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp.  18–24, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-2703. URL https://aclanthology.org/W18-2703.
  20. Mark Hopkins. Towards more natural artificial languages. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pp.  85–94, Abu Dhabi, United Arab Emirates (Hybrid), Dec 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.conll-1.7.
  21. Improving low-resource neural machine translation with filtered pseudo-parallel corpus. In Proceedings of the 4th Workshop on Asian Translation (WAT2017), pp.  70–78, Taipei, Taiwan, November 2017. Asian Federation of Natural Language Processing. URL https://aclanthology.org/W17-5704.
  22. End-to-end statistical machine translation with zero or small parallel texts. Natural Language Engineering, 22(4):517–548, 2016. doi: 10.1017/S1351324916000127.
  23. Cross-lingual ability of multilingual bert: An empirical study. ArXiv, abs/1912.07840, 2019.
  24. Gaurav Kharkwal. Taming the jabberwocky: examining sentence processing with novel words. PhD thesis, Rutgers University - Graduate School - New Brunswick, 2014. URL https://rucore.libraries.rutgers.edu/rutgers-lib/45309/#citation-export.
  25. Filtering back-translated data in unsupervised neural machine translation. In Proceedings of the 28th International Conference on Computational Linguistics, pp.  4334–4339, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.383. URL https://aclanthology.org/2020.coling-main.383.
  26. Effective cross-lingual transfer of neural machine translation models without shared vocabularies. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  1246–1257, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1120. URL https://aclanthology.org/P19-1120.
  27. When and why is unsupervised neural machine translation useless? In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pp.  35–44, Lisboa, Portugal, November 2020. European Association for Machine Translation. URL https://aclanthology.org/2020.eamt-1.5.
  28. Toward statistical machine translation without parallel corpora. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’12, pp.  130–140, USA, 2012. Association for Computational Linguistics. ISBN 9781937284190.
  29. Domain adaptation for NMT via filtered iterative back-translation. In Proceedings of the Second Workshop on Domain Adaptation for NLP, pp.  263–271, Kyiv, Ukraine, April 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.adaptnlp-1.26.
  30. Cross-lingual Language Model Pretraining. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019.
  31. Unsupervised machine translation using monolingual corpora only. In International Conference of Learning Representations (ICLR), 2018a.
  32. Word translation without parallel data. In International Conference on Learning Representations, 2018b. URL https://openreview.net/forum?id=H196sainb.
  33. Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  5039–5049, Brussels, Belgium, Oct 2018c. Association for Computational Linguistics. doi: 10.18653/v1/D18-1549. URL https://aclanthology.org/D18-1549.
  34. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742, 2020. doi: 10.1162/tacl_a_00343. URL https://aclanthology.org/2020.tacl-1.47.
  35. Exploiting similarities among languages for machine translation. CoRR, abs/1309.4168, 2013. URL http://arxiv.org/abs/1309.4168.
  36. Bi-directional neural machine translation with synthetic parallel data. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp.  84–91, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-2710. URL https://aclanthology.org/W18-2710.
  37. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.  311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
  38. Steven T. Piantadosi. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin and Review, 21(5):1112–1130, 2014. ISSN 15315320. doi: 10.3758/s13423-014-0585-6.
  39. Investigating backtranslation in neural machine translation. In European Association for Machine Translation Conferences/Workshops, 2018.
  40. Translating translationese: A two-step approach to unsupervised machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  3057–3062, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1293. URL https://aclanthology.org/P19-1293.
  41. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  86–96, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1009. URL https://aclanthology.org/P16-1009.
  42. MASS: Masked sequence to sequence pre-training for language generation. 36th International Conference on Machine Learning, ICML 2019, 2019-June:10384–10394, 2019.
  43. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pp.  3104–3112, Cambridge, MA, USA, 2014. MIT Press.
  44. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp.  6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  45. Multi-agent dual learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyGhN2A5tm.
  46. Examining the inductive bias of neural language models with artificial languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  454–463, Online, Aug 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.38. URL https://aclanthology.org/2021.acl-long.38.
  47. Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  833–844, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1077. URL https://aclanthology.org/D19-1077.
  48. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Technical report, 2016.
  49. Dual supervised learning. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp.  3789–3798. JMLR.org, 2017.
  50. Unsupervised neural machine translation with weight sharing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  46–55, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1005. URL https://aclanthology.org/P18-1005.
  51. On the complementarity between pre-training and random-initialization for resource-rich machine translation. ArXiv, abs/2209.03316, 2022.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com