Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MuRIL: Multilingual Representations for Indian Languages (2103.10730v2)

Published 19 Mar 2021 in cs.CL

Abstract: India is a multilingual society with 1369 rationalized languages and dialects being spoken across the country (INDIA, 2011). Of these, the 22 scheduled languages have a staggering total of 1.17 billion speakers and 121 languages have more than 10,000 speakers (INDIA, 2011). India also has the second largest (and an ever growing) digital footprint (Statista, 2020). Despite this, today's state-of-the-art multilingual systems perform suboptimally on Indian (IN) languages. This can be explained by the fact that multilingual LLMs (LMs) are often trained on 100+ languages together, leading to a small representation of IN languages in their vocabulary and training data. Multilingual LMs are substantially less effective in resource-lean scenarios (Wu and Dredze, 2020; Lauscher et al., 2020), as limited data doesn't help capture the various nuances of a language. One also commonly observes IN language text transliterated to Latin or code-mixed with English, especially in informal settings (for example, on social media platforms) (Rijhwani et al., 2017). This phenomenon is not adequately handled by current state-of-the-art multilingual LMs. To address the aforementioned gaps, we propose MuRIL, a multilingual LM specifically built for IN languages. MuRIL is trained on significantly large amounts of IN text corpora only. We explicitly augment monolingual text corpora with both translated and transliterated document pairs, that serve as supervised cross-lingual signals in training. MuRIL significantly outperforms multilingual BERT (mBERT) on all tasks in the challenging cross-lingual XTREME benchmark (Hu et al., 2020). We also present results on transliterated (native to Latin script) test sets of the chosen datasets and demonstrate the efficacy of MuRIL in handling transliterated data.

Analysis of MuRIL: Multilingual Representations for Indian Languages

The paper "MuRIL: Multilingual Representations for Indian Languages" presents a compelling approach to bridging the gap in language representation for the diverse linguistic landscape of India. Recognizing the limitations of existing multilingual LLMs (LMs) such as multilingual BERT (mBERT), the authors propose MuRIL, a multilingual LM specifically tailored to handle Indian (IN) languages more effectively.

Motivation and Context

India's linguistic diversity reflects over 1369 languages and dialects, with 22 scheduled languages spoken by a substantial portion of the population. Despite this, state-of-the-art multilingual systems often underperform for these languages due to their limited representation in LMs. This underperformance is attributed to several factors, including the scarcity of labeled data and the phenomenon of transliteration and code-mixing prevalent in Indian textual data. Current LMs, trained on a broad spectrum of global languages, don't adequately capture these unique linguistic features.

MuRIL's Approach

MuRIL is designed to address the aforementioned gaps by incorporating extensive IN language data. It supports 17 languages, including a broad mix of the major Indian languages alongside English. The model employs two primary LLMing objectives: Masked LLMing (MLM) and Translation LLMing (TLM). This dual-objective approach enables MuRIL to leverage both monolingual and parallel corpora, improving its cross-lingual capabilities through the inclusion of transliterated document pairs.

Data and Training

The training corpus for MuRIL is sourced from multiple datasets including Common Crawl OSCAR and Wikipedia, supplemented by the PMINDIA and Dakshina Datasets for translated and transliterated data, respectively. The data is subjected to strategic upsampling to ensure equitable representation across diverse languages, taking into account their real-world usage frequencies.

A specific emphasis is placed on improving semantic retention during tokenization. By introducing a cased WordPiece vocabulary deriving from upsampled datasets, MuRIL facilitates better language representation, especially for transliterated and less commonly digitized scripts.

Evaluation and Results

MuRIL is evaluated using the XTREME benchmark, renowned for testing cross-lingual understanding. In this zero-shot evaluation framework, MuRIL shows marked improvements over mBERT across all tasks for both native script and transliterated datasets. For example, MuRIL's performance on the PANX task indicates an F1 score of 77.6, surpassing mBERT by a significant margin. Such results stress MuRIL's efficacy in addressing the inherent linguistic complexity of IN languages.

Moreover, MuRIL exhibits substantial improvements in handling transliterated data, indicating its robust adaptation to informal textual phenomena prevalent in social media and other digital communications.

Implications and Future Directions

The implications of MuRIL are manifold, including its potential in enhancing natural language processing applications for Indian languages. As a foundational model, MuRIL could facilitate more accurate machine translation, sentiment analysis, and named entity recognition tailored to regional contexts. Furthermore, by improving support for transliterated and code-mixed data, it lays the groundwork for innovations in multilingual digital communication.

Looking forward, subsequent endeavors may explore extending MuRIL's reach to additional dialects or less documented languages, enhancing language inclusivity in machine learning systems. Future research can also focus on refining the model's ability to handle increasingly complex linguistic data and tasks, adapting to evolving digital vernaculars and linguistic trends.

Overall, MuRIL stands as a testament to focused linguistic resource allocation, underscoring the significance of culturally-tailored LLMs in the diverse and vibrant linguistic ecosystem of India.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Simran Khanuja (19 papers)
  2. Diksha Bansal (4 papers)
  3. Sarvesh Mehtani (2 papers)
  4. Savya Khosla (9 papers)
  5. Atreyee Dey (1 paper)
  6. Balaji Gopalan (2 papers)
  7. Dilip Kumar Margam (3 papers)
  8. Pooja Aggarwal (9 papers)
  9. Rajiv Teja Nagipogu (2 papers)
  10. Shachi Dave (12 papers)
  11. Shruti Gupta (4 papers)
  12. Subhash Chandra Bose Gali (1 paper)
  13. Vish Subramanian (1 paper)
  14. Partha Talukdar (51 papers)
Citations (247)