Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Low-resource neural machine translation with morphological modeling (2404.02392v1)

Published 3 Apr 2024 in cs.CL

Abstract: Morphological modeling in neural machine translation (NMT) is a promising approach to achieving open-vocabulary machine translation for morphologically-rich languages. However, existing methods such as sub-word tokenization and character-based models are limited to the surface forms of the words. In this work, we propose a framework-solution for modeling complex morphology in low-resource settings. A two-tier transformer architecture is chosen to encode morphological information at the inputs. At the target-side output, a multi-task multi-label training scheme coupled with a beam search-based decoder are found to improve machine translation performance. An attention augmentation scheme to the transformer model is proposed in a generic form to allow integration of pre-trained LLMs and also facilitate modeling of word order relationships between the source and target languages. Several data augmentation techniques are evaluated and shown to increase translation performance in low-resource settings. We evaluate our proposed solution on Kinyarwanda - English translation using public-domain parallel text. Our final models achieve competitive performance in relation to large multi-lingual models. We hope that our results will motivate more use of explicit morphological information and the proposed model and data augmentations in low-resource NMT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. A few thousand translations go a long way! leveraging pre-trained models for african news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070.
  2. Željko Agić and Ivan Vulić. 2019. Jw300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3204–3210.
  3. Tico-19: the translation initiative for covid-19. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020.
  4. Duygu Ataman and Marcello Federico. 2018. Compositional representation of morphologically-rich input for neural machine translation. arXiv preprint arXiv:1805.02036.
  5. Alan Bundy and Lincoln Wallen. 1984. Morphographemics: Alias: spelling rules. Catalogue of Artificial Intelligence Tools, pages 76–77.
  6. No language left behind: Scaling human-centered machine translation.
  7. Torchmetrics-measuring reproducibility in pytorch. Journal of Open Source Software, 7(70):4101.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  9. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500.
  10. Emmanuel Habumuremyi and Claudine Uwamahoro. 2006. IRIZA-STARTER 2006: A Bilingual Kinyarwanda-English and English-Kinyarwanda Dictionary. Rwanda Community Net.
  11. Sharon Inkelas and Cheryl Zoll. 2000. Reduplication as morphological doubling. Manuscript, University of California, Berkeley and Massachusetts Institute of Technology.
  12. Mark Kantrowitz and Bill Ross. 2018. Names corpus, version 1.3.
  13. Non-concatenative morphology. Ms., Humboldt-Universität zu Berlin and Oakland University.
  14. Rethinking positional encoding in language pre-training. arXiv preprint arXiv:2006.15595.
  15. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71.
  16. Modeling source syntax for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 688–697.
  17. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
  18. Morphological and language-agnostic word segmentation for nmt. In International Conference on Text, Speech, and Dialogue, pages 277–284. Springer.
  19. Addressing word-order divergence in multilingual neural machine translation for extremely low resource languages. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3868–3873.
  20. Facebook fair’s wmt19 news translation task submission. arXiv preprint arXiv:1907.06616.
  21. Toan Q Nguyen and Julian Salazar. 2019. Transformers without tears: Improving the normalization of self-attention. In Proceedings of the 16th International Conference on Spoken Language Translation.
  22. Antoine Nzeyimana. 2020. Morphological disambiguation from stemming data. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4649–4660.
  23. Antoine Nzeyimana and Andre Niyongabo Rubungo. 2022. Kinyabert: a morphology-aware kinyarwanda language model. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5347–5363.
  24. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.
  25. Improving character-based decoding using target-side morphological information for neural machine translation. arXiv preprint arXiv:1804.06506.
  26. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  27. Maja Popović. 2017. chrf++: words helping character n-grams. In Proceedings of the second conference on machine translation, pages 612–618.
  28. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  29. The curious case of hallucinations in neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1172–1183.
  30. Philippe Remy. 2021. Name dataset. https://github.com/philipperemy/name-dataset.
  31. Bleurt: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892.
  32. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725.
  33. Multilingual translation via grafting pre-trained language models. arXiv preprint arXiv:2109.05256.
  34. Jörg Tiedemann and Santhosh Thottingal. 2020. Opus-mt–building open translation services for the world. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 479–480.
  35. Attention is all you need. Advances in neural information processing systems, 30.
  36. Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. In International Conference on Learning Representations.
  37. Marion Weller-Di Marco and Alexander Fraser. 2020. Modeling word formation in english–german neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4227–4232.
  38. Understanding and detecting hallucinations in neural machine translation via model introspection. Transactions of the Association for Computational Linguistics, 11:546–564.
  39. Incorporating bert into neural machine translation. arXiv preprint arXiv:2002.06823.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Antoine Nzeyimana (6 papers)
Citations (2)