Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study (2404.08259v1)
Abstract: Machine Translation has made impressive progress in recent years offering close to human-level performance on many languages, but studies have primarily focused on high-resource languages with broad online presence and resources. With the help of growing LLMs, more and more low-resource languages achieve better results through the presence of other languages. However, studies have shown that not all low-resource languages can benefit from multilingual systems, especially those with insufficient training and evaluation data. In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian. We investigate conditions of low-resource languages such as data scarcity and parameter sensitivity and focus on refined solutions that combat low-resource difficulties and creative solutions such as harnessing language similarity. Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance. We demonstrate noisiness in the data and present our approach to carry out text preprocessing extensively. Evaluation was conducted using combined metrics: BLEU, chrF and TER. Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement. Furthermore, we present a qualitative analysis of translation errors and system limitations.
- Phrase-Based & Neural Unsupervised Machine Translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5039–5049, Brussels, Belgium, 2018. Association for Computational Linguistics.
- Two Parents, One Child: Dual Transfer for Low-Resource Neural Machine Translation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2726–2738, Online, 2021. Association for Computational Linguistics.
- Transfer Learning for Low-Resource Neural Machine Translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1568–1575, Austin, Texas, 2016. Association for Computational Linguistics.
- Enhancing Spanish-Quechua machine translation with pre-trained models and diverse data sources: LCT-EHU at AmericasNLP shared task. In Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP), pages 156–162, Toronto, Canada, July 2023. Association for Computational Linguistics.
- On the use of BERT for neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 108–117, Hong Kong, November 2019. Association for Computational Linguistics.
- Knowledge transfer in incremental learning for multilingual neural machine translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15286–15304, Toronto, Canada, July 2023. Association for Computational Linguistics.
- An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages. In Proceedings of the 12th Conference on Language Resources and Evaluation, pages 3710–2718. European Language Resources Association (ELRA), 2020.
- Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3874–3884, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- Exploiting Multilingualism through Multistage Fine-Tuning for Low-Resource Neural Machine Translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1410–1416, Hong Kong, China, 2019. Association for Computational Linguistics.
- Small data, big impact: Leveraging minimal data for effective machine translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2740–2756, Toronto, Canada, July 2023. Association for Computational Linguistics.
- Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3965–3976, Barcelona, Spain (Online), 2020. International Committee on Computational Linguistics.
- Findings of the 2021 Conference on Machine Translation (WMT21). In Proceedings of the Sixth Conference on Machine Translation, pages 1–88, Online, November 2021. Association for Computational Linguistics.
- Findings of the WMT 2021 Shared Tasks in Unsupervised MT and Very Low Resource Supervised MT. In Proceedings of the Sixth Conference on Machine Translation, pages 726–732, Online, November 2021. Association for Computational Linguistics.
- NRC-CNRC systems for Upper Sorbian-German and Lower Sorbian-German machine translation 2021. In Proceedings of the Sixth Conference on Machine Translation, pages 999–1008, Online, November 2021. Association for Computational Linguistics.
- Attention is all you need. CoRR, abs/1706.03762, 2017.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
- Maja Popović. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal, September 2015. Association for Computational Linguistics.
- A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 223–231, Cambridge, Massachusetts, USA, August 8-12 2006. Association for Machine Translation in the Americas.
- To ship or not to ship: An extensive evaluation of automatic metrics for machine translation. In Proceedings of the Sixth Conference on Machine Translation, pages 478–494, Online, November 2021. Association for Computational Linguistics.
- BLEU might be guilty but references are not innocent. CoRR, abs/2004.06063, 2020.
- Jörg Tiedemann. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey, May 2012. European Language Resources Association (ELRA).
- AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1306–1320, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
- Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 162–168, Online, 2020. Association for Computational Linguistics.
- Improving the Lexical Ability of Pretrained Language Models for Unsupervised Neural Machine Translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 173–180, Online, 2021. Association for Computational Linguistics.
- A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 789–798, Melbourne, Australia, 2018. Association for Computational Linguistics.
- Unsupervised machine translation using monolingual corpora only. In International Conference on Learning Representations, 2018.
- Emily M. Bender. The #BenderRule: On Naming the Languages We Study and Why It Matters, 2019. https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/.
- The FLORES Evaluation Datasets for Low-Resource Machine Translation: Nepali–English and Sinhala–English. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6097–6110, Hong Kong, China, 2019. Association for Computational Linguistics.
- Selection Criteria for Low Resource Language Programs. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4543–4549, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA).
- Machine translation from Standard German to alemannic dialects. In Maite Melero, Sakriani Sakti, and Claudia Soria, editors, Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, pages 129–136, Marseille, France, June 2022. European Language Resources Association.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics.
- BPE-dropout: Simple and effective subword regularization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1882–1892, Online, July 2020. Association for Computational Linguistics.
- Search engine guided non-parametric neural machine translation. CoRR, abs/1705.07267, 2017.
- Nearest neighbor machine translation. In International Conference on Learning Representations, 2021.
- Chunk-based nearest neighbor machine translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4228–4245, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
- Trivial Transfer Learning for Low-Resource Neural Machine Translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 244–252, Belgium, Brussels, 2018. Association for Computational Linguistics.
- ConsistTL: Modeling consistency in transfer learning for low-resource neural machine translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8383–8394, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
- Recycling a pre-trained BERT encoder for neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 23–31, Hong Kong, November 2019. Association for Computational Linguistics.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- Cross-lingual language model pretraining. CoRR, abs/1901.07291, 2019.
- Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1754–1765, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
- Improving machine translation with phrase pair injection and corpus filtering. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5395–5400, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
- BLEU in characters: Towards automatic MT evaluation in languages without word delimiters. In Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts, 2005.
- Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1723–1732, Beijing, China, July 2015. Association for Computational Linguistics.
- Introducing the Asian language treebank (ALT). In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1574–1578, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA).
- Fixing MoE over-fitting on low-resource languages in multilingual machine translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 14237–14253, Toronto, Canada, July 2023. Association for Computational Linguistics.
- The curious case of neural text degeneration. CoRR, abs/1904.09751, 2019.
- Proceedings of the Sixth Conference on Machine Translation. Association for Computational Linguistics, Online, November 2021.
- Translating similar languages: Role of mutual intelligibility in multilingual transformers. In Proceedings of the Fifth Conference on Machine Translation, pages 381–386, Online, November 2020. Association for Computational Linguistics.
- NoahNMT at WMT 2021: Dual transfer for very low resource supervised machine translation. In Proceedings of the Sixth Conference on Machine Translation, pages 1009–1013, Online, November 2021. Association for Computational Linguistics.
- Improving neural machine translation models with monolingual data. CoRR, abs/1511.06709, 2015.
- Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, pages 220–224, Uppsala, Sweden, July 2010. Association for Computational Linguistics.
- Jörg Tiedemann. The Tatoeba Translation Challenge – Realistic data sets for low resource and multilingual MT. In Proceedings of the Fifth Conference on Machine Translation, pages 1174–1182, Online, November 2020. Association for Computational Linguistics.
- Six Challenges for Neural Machine Translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver, 2017. Association for Computational Linguistics.
- The sockeye 2 neural machine translation toolkit at AMTA 2020. In Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pages 110–115, Virtual, October 2020. Association for Machine Translation in the Americas.
- M Aickin and H Gensler. Adjusting for multiple testing when reporting research results: The bonferroni vs holm methods, May 1996.
- William S Noble. How does multiple testing correction work?, Dec 2009.
- Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics, 9:1460–1474, 12 2021.
- IndicMT eval: A dataset to meta-evaluate machine translation metrics for Indian languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14210–14228, Toronto, Canada, July 2023. Association for Computational Linguistics.
- COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online, November 2020. Association for Computational Linguistics.
- Results of the WMT20 metrics shared task. In Proceedings of the Fifth Conference on Machine Translation, pages 688–725, Online, November 2020. Association for Computational Linguistics.
- Statistical power and translationese in machine translation evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 72–81, Online, November 2020. Association for Computational Linguistics.
- Ludwig Zehetner. Zur schreibung des bairischen. Schmankerl, 37:31–32, 1978.
- Ethical considerations for machine translation of indigenous languages: Giving a voice to the speakers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4871–4897, Toronto, Canada, July 2023. Association for Computational Linguistics.
- Wan-Hua Her (1 paper)
- Udo Kruschwitz (24 papers)