Low-resource neural machine translation with morphological modeling (2404.02392v1)
Abstract: Morphological modeling in neural machine translation (NMT) is a promising approach to achieving open-vocabulary machine translation for morphologically-rich languages. However, existing methods such as sub-word tokenization and character-based models are limited to the surface forms of the words. In this work, we propose a framework-solution for modeling complex morphology in low-resource settings. A two-tier transformer architecture is chosen to encode morphological information at the inputs. At the target-side output, a multi-task multi-label training scheme coupled with a beam search-based decoder are found to improve machine translation performance. An attention augmentation scheme to the transformer model is proposed in a generic form to allow integration of pre-trained LLMs and also facilitate modeling of word order relationships between the source and target languages. Several data augmentation techniques are evaluated and shown to increase translation performance in low-resource settings. We evaluate our proposed solution on Kinyarwanda - English translation using public-domain parallel text. Our final models achieve competitive performance in relation to large multi-lingual models. We hope that our results will motivate more use of explicit morphological information and the proposed model and data augmentations in low-resource NMT.
- A few thousand translations go a long way! leveraging pre-trained models for african news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070.
- Željko Agić and Ivan Vulić. 2019. Jw300: A wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3204–3210.
- Tico-19: the translation initiative for covid-19. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020.
- Duygu Ataman and Marcello Federico. 2018. Compositional representation of morphologically-rich input for neural machine translation. arXiv preprint arXiv:1805.02036.
- Alan Bundy and Lincoln Wallen. 1984. Morphographemics: Alias: spelling rules. Catalogue of Artificial Intelligence Tools, pages 76–77.
- No language left behind: Scaling human-centered machine translation.
- Torchmetrics-measuring reproducibility in pytorch. Journal of Open Source Software, 7(70):4101.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 489–500.
- Emmanuel Habumuremyi and Claudine Uwamahoro. 2006. IRIZA-STARTER 2006: A Bilingual Kinyarwanda-English and English-Kinyarwanda Dictionary. Rwanda Community Net.
- Sharon Inkelas and Cheryl Zoll. 2000. Reduplication as morphological doubling. Manuscript, University of California, Berkeley and Massachusetts Institute of Technology.
- Mark Kantrowitz and Bill Ross. 2018. Names corpus, version 1.3.
- Non-concatenative morphology. Ms., Humboldt-Universität zu Berlin and Oakland University.
- Rethinking positional encoding in language pre-training. arXiv preprint arXiv:2006.15595.
- Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71.
- Modeling source syntax for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 688–697.
- Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics, 8:726–742.
- Morphological and language-agnostic word segmentation for nmt. In International Conference on Text, Speech, and Dialogue, pages 277–284. Springer.
- Addressing word-order divergence in multilingual neural machine translation for extremely low resource languages. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3868–3873.
- Facebook fair’s wmt19 news translation task submission. arXiv preprint arXiv:1907.06616.
- Toan Q Nguyen and Julian Salazar. 2019. Transformers without tears: Improving the normalization of self-attention. In Proceedings of the 16th International Conference on Spoken Language Translation.
- Antoine Nzeyimana. 2020. Morphological disambiguation from stemming data. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4649–4660.
- Antoine Nzeyimana and Andre Niyongabo Rubungo. 2022. Kinyabert: a morphology-aware kinyarwanda language model. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5347–5363.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53.
- Improving character-based decoding using target-side morphological information for neural machine translation. arXiv preprint arXiv:1804.06506.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
- Maja Popović. 2017. chrf++: words helping character n-grams. In Proceedings of the second conference on machine translation, pages 612–618.
- Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
- The curious case of hallucinations in neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1172–1183.
- Philippe Remy. 2021. Name dataset. https://github.com/philipperemy/name-dataset.
- Bleurt: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725.
- Multilingual translation via grafting pre-trained language models. arXiv preprint arXiv:2109.05256.
- Jörg Tiedemann and Santhosh Thottingal. 2020. Opus-mt–building open translation services for the world. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 479–480.
- Attention is all you need. Advances in neural information processing systems, 30.
- Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. In International Conference on Learning Representations.
- Marion Weller-Di Marco and Alexander Fraser. 2020. Modeling word formation in english–german neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4227–4232.
- Understanding and detecting hallucinations in neural machine translation via model introspection. Transactions of the Association for Computational Linguistics, 11:546–564.
- Incorporating bert into neural machine translation. arXiv preprint arXiv:2002.06823.
- Antoine Nzeyimana (6 papers)