IndicBART: A Pre-trained Model for Indic Natural Language Generation (2109.02903v2)

Published 7 Sep 2021 in cs.CL and cs.AI

Abstract: In this paper, we study pre-trained sequence-to-sequence models for a group of related languages, with a focus on Indic languages. We present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. We evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization. Our experiments on NMT and extreme summarization show that a model specific to related languages like IndicBART is competitive with large pre-trained models like mBART50 despite being significantly smaller. It also performs well on very low-resource translation scenarios where languages are not included in pre-training or fine-tuning. Script sharing, multilingual training, and better utilization of limited model capacity contribute to the good performance of the compact IndicBART model.

PDF Abstract

IndicBART: A Pre-trained Model for Indic Natural Language Generation

The paper "IndicBART: A Pre-trained Model for Indic Natural Language Generation" presents a novel approach to sequence-to-sequence (S2S) model pre-training specifically focused on Indic languages. As language-specific models are becoming increasingly utilized, this paper fills a crucial gap by proposing a compact multilingual S2S model, IndicBART, which aims to improve natural language generation tasks such as Neural Machine Translation (NMT) and extreme summarization for Indic languages.

IndicBART is distinctive due to its strategic utilization of orthographic similarities within Indic scripts, facilitating enhanced cross-lingual transfer learning. The research emphasizes the model's ability to efficiently handle low-resource languages and support multilingual training scenarios. With a mere 244 million parameters, IndicBART offers competitive performance compared to larger models like mBART50 (with 611 million parameters), making it substantially more resource-efficient. Furthermore, a variant named IndicALBART compresses parameters further to 97 million, showcasing potential for deployment in environments with constrained computational capabilities.

Experimental Evaluation

IndicBART was evaluated on several critical natural language generation tasks, particularly focusing on NMT. The results demonstrate that IndicBART performs comparably and sometimes superiorly to mBART50, especially in low-resource settings and zero-shot scenarios. The paper reports improvements up to 2 BLEU/ROUGE points in certain tasks, underscoring its efficacy despite the compactness. The performance advantages are rooted in underestimated but impactful elements such as script unification and shared subwords.

The researchers employed various experimental setups to test the model's robustness. IndicBART showed strong results in scenarios involving languages not seen during pre-training or fine-tuning, suggesting significant cross-lingual transfer capability. Additionally, script unification, achieved by mapping different Indic scripts to a single script, improved performance, highlighting a novel benefit of exploiting orthographic similarities.

Implications and Future Directions

IndicBART represents a step forward in optimized pre-trained models tailored for language groups, showing promise in democratizing access to NLP technologies. Its efficient architecture holds potential for practical deployments across diverse computational environments. The findings suggest robust application possibilities, including translating among languages with historically limited data resources and supporting multilingual tasks with shared linguistic features derived from geographical and genetic relatedness.

From a theoretical perspective, the paper opens pathways to further explore language-specific pre-training strategies and their impacts on enhancing cross-lingual tasks, specifically within multilingual frameworks. Future works could aim to expand IndicBART's language coverage, potentially encompassing languages from India's broad 8th schedule, thus fostering inclusive technology adoption.

Moreover, extensions to accommodate document-level training and optimization for larger text corpora could further amplify IndicBART’s application versatility. Enhanced cross-lingual transfer methods and advanced multi-task learning paradigms remain vital areas for exploration.

The paper constitutes a significant contribution to multilingual NLP, emphasizing compact yet powerful models capable of overcoming existing challenges in Indic language processing and setting precedents for future research endeavors within similar linguistic domains.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Raj Dabre (65 papers)
Himani Shrotriya (2 papers)
Anoop Kunchukuttan (45 papers)
Ratish Puduppully (20 papers)
Mitesh M. Khapra (79 papers)
Pratyush Kumar (44 papers)

Citations (56)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos