Enhancing Low-Resource NMT with a Multilingual Encoder and Knowledge Distillation: A Case Study (2407.06538v1)

Published 9 Jul 2024 in cs.CL

Abstract: Neural Machine Translation (NMT) remains a formidable challenge, especially when dealing with low-resource languages. Pre-trained sequence-to-sequence (seq2seq) multi-lingual models, such as mBART-50, have demonstrated impressive performance in various low-resource NMT tasks. However, their pre-training has been confined to 50 languages, leaving out support for numerous low-resource languages, particularly those spoken in the Indian subcontinent. Expanding mBART-50's language support requires complex pre-training, risking performance decline due to catastrophic forgetting. Considering these expanding challenges, this paper explores a framework that leverages the benefits of a pre-trained LLM along with knowledge distillation in a seq2seq architecture to facilitate translation for low-resource languages, including those not covered by mBART-50. The proposed framework employs a multilingual encoder-based seq2seq model as the foundational architecture and subsequently uses complementary knowledge distillation techniques to mitigate the impact of imbalanced training. Our framework is evaluated on three low-resource Indic languages in four Indic-to-Indic directions, yielding significant BLEU-4 and chrF improvements over baselines. Further, we conduct human evaluation to confirm effectiveness of our approach. Our code is publicly available at https://github.com/raypretam/Two-step-low-res-NMT.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (43)

Authors (5)

Aniruddha Roy (3 papers)
Pretam Ray (5 papers)
Ayush Maheshwari (19 papers)
Sudeshna Sarkar (19 papers)
Pawan Goyal (170 papers)

GitHub

raypretam/Two-step-low-res-NMT · GitHub

Enhancing Low-Resource NMT with a Multilingual Encoder and Knowledge Distillation: A Case Study (2407.06538v1)

Related Papers

GitHub