Enhancing Low-Resource NMT with a Multilingual Encoder and Knowledge Distillation: A Case Study (2407.06538v1)
Abstract: Neural Machine Translation (NMT) remains a formidable challenge, especially when dealing with low-resource languages. Pre-trained sequence-to-sequence (seq2seq) multi-lingual models, such as mBART-50, have demonstrated impressive performance in various low-resource NMT tasks. However, their pre-training has been confined to 50 languages, leaving out support for numerous low-resource languages, particularly those spoken in the Indian subcontinent. Expanding mBART-50's language support requires complex pre-training, risking performance decline due to catastrophic forgetting. Considering these expanding challenges, this paper explores a framework that leverages the benefits of a pre-trained LLM along with knowledge distillation in a seq2seq architecture to facilitate translation for low-resource languages, including those not covered by mBART-50. The proposed framework employs a multilingual encoder-based seq2seq model as the foundational architecture and subsequently uses complementary knowledge distillation techniques to mitigate the impact of imbalanced training. Our framework is evaluated on three low-resource Indic languages in four Indic-to-Indic directions, yielding significant BLEU-4 and chrF improvements over baselines. Further, we conduct human evaluation to confirm effectiveness of our approach. Our code is publicly available at https://github.com/raypretam/Two-step-low-res-NMT.
- Neural machine translation by jointly learning to align and translate.
- Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61, Florence, Italy. Association for Computational Linguistics.
- Language model prior for low-resource neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7622–7634, Online. Association for Computational Linguistics.
- Zero-shot cross-lingual transfer of neural machine translation with multilingual pretrained encoders. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 15–26, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Towards making the most of cross-lingual transfer for zero-shot neural machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 142–157, Dublin, Ireland. Association for Computational Linguistics.
- Distilling knowledge learned in BERT for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7893–7905, Online. Association for Computational Linguistics.
- Cross-lingual natural language generation via pre-training.
- Unsupervised cross-lingual representation learning at scale.
- Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. Advances in neural information processing systems, 32.
- Guiding teacher forcing with seer forcing for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2862–2872, Online. Association for Computational Linguistics.
- Robert French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3:128–135.
- Non-autoregressive neural machine translation. In International Conference on Learning Representations (ICLR).
- Survey of low-resource machine translation. Computational Linguistics, 48(3):673–732.
- Kenji Imamura and Eiichiro Sumita. 2019. Recycling a pre-trained BERT encoder for neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 23–31, Hong Kong. Association for Computational Linguistics.
- Nearest neighbor machine translation. In International Conference on Learning Representations.
- Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation.
- Diederik P. Kingma and Jimmy Ba. 2017. Adam: A method for stochastic optimization.
- Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50. Springer.
- Multilingual bidirectional unsupervised translation through multilingual finetuning and back-translation. In Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023), pages 16–31.
- Norm-based curriculum learning for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 427–436, Online. Association for Computational Linguistics.
- On the copying behaviors of pre-training for neural machine translation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4265–4275, Online. Association for Computational Linguistics.
- Multilingual denoising pre-training for neural machine translation.
- Xlm-t: Scaling up multilingual machine translation with pretrained cross-lingual transformer encoders.
- Zmbart: An unsupervised cross-lingual transfer framework for language generation.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
- Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Transactions of the Association for Computational Linguistics, 10:145–162.
- Meta-ed: Cross-lingual event detection using meta-learning for indian languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 22(2).
- Does meta-learning help mBERT for few-shot question generation in a cross-lingual transfer setting for indic languages? In Proceedings of the 29th International Conference on Computational Linguistics, pages 4251–4257, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Chenze Shao and Yang Feng. 2022. Overcoming catastrophic forgetting beyond continual learning: Balanced training for neural machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2023–2036.
- Mass: Masked sequence to sequence pre-training for language generation.
- NLLB Team. 2022. No language left behind: Scaling human-centered machine translation.
- Attention is all you need.
- Selective knowledge distillation for neural machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6456–6466.
- Understanding and improving sequence-to-sequence pretraining for neural machine translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2591–2600, Dublin, Ireland. Association for Computational Linguistics.
- Acquiring knowledge from pre-trained model to neural machine translation.
- Why skip if you can combine: A simple knowledge distillation technique for intermediate layers.
- Towards making the most of bert in neural machine translation.
- Deep mutual learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4320–4328.
- Understanding knowledge distillation in non-autoregressive machine translation.
- Incorporating bert into neural machine translation.
- Knowledge distillation by on-the-fly native ensemble. Advances in neural information processing systems, 31.
- Aniruddha Roy (3 papers)
- Pretam Ray (5 papers)
- Ayush Maheshwari (19 papers)
- Sudeshna Sarkar (19 papers)
- Pawan Goyal (170 papers)