L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources (2202.01159v2)

Published 2 Feb 2022 in cs.CL and cs.LG

Abstract: We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We further present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked LLMs, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. We show the effectiveness of these resources on downstream Marathi sentiment analysis, text classification, and named entity recognition (NER) tasks. We also release MahaGPT, a generative Marathi GPT model trained on Marathi corpus. Marathi is a popular language in India but still lacks these resources. This work is a step forward in building open resources for the Marathi language. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .

PDF Abstract

Overview of L3Cube-MahaCorpus and MahaBERT for Marathi NLP

The paper "L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT LLMs, and Resources" introduces significant contributions to the field of NLP for the Marathi language. This work is particularly noteworthy for addressing the computational needs of a low-resource language spoken by millions in India. The authors present a comprehensive Marathi monolingual corpus, L3Cube-MahaCorpus, alongside robust Transformer-based LLMs and word embeddings tailored for Marathi.

Monolingual Corpus Development

L3Cube-MahaCorpus extends existing Marathi linguistic resources by integrating 24.8 million sentences and 289 million tokens, sourced from both news and non-news websites. This enrichment counters the prevalent bias towards Hindi in Indian language corpora and emphasizes the significance of enhancing Marathi textual resources. The full Marathi dataset, post-integration with existing resources, encompasses 57.2 million sentences and 752 million tokens, making it one of the most extensive Marathi corpuses available.

LLMs

The paper introduces several pre-trained LLMs optimized for Marathi, namely MahaBERT, MahaAlBERT, and MahaRoBERTa, each a variant based on BERT, AlBERT, and RoBERTa architectures. These models are trained using a masked LLMing (MLM) strategy on the full Marathi corpus. In the context of multilingual capabilities, these monolingual models have demonstrated better performance compared to generic multilingual models, such as mBERT and XLM-RoBERTa, in downstream applications like text classification and Named Entity Recognition (NER).

Furthermore, the paper also details MahaGPT, a generative pre-trained transformer model tailored for Marathi, emphasizing continuity in generating Marathi text.

Word Embeddings

The authors present MahaFT, fast text word embeddings for Marathi, trained on the comprehensive corpus. These embeddings leverage FastText's subword-level training approach, proving beneficial for the agglutinative nature of Marathi. Comparative evaluations of MahaFT against existing embeddings, such as Facebook's FastText and IndicNLP Suite's embeddings, spotlight its competitive edge in facilitating efficient NLP tasks.

Evaluation on NLP Tasks

The models were extensively evaluated on several Marathi NLP tasks:

Sentiment Analysis (L3CubeMahaSent): Classifies sentiments from tweets.
Text Classification: Differentiates categories in news articles and headlines.
Named Entity Recognition (NER): Identifies and categorizes named entities into predefined segments.

MahaBERT and its derivatives consistently outperformed their multilingual predecessors across these tasks, underscoring the importance of specialized monolingual model training.

Implications and Future Prospects

The contributions in this paper pave the way for enhanced NLP applications tailored to Marathi, potentially expanding to other underrepresented Indic languages. By establishing a robust database and pre-trained models, the authors support advancements in sentiment analysis, automated content classification, and other linguistically complex tasks.

For future work, exploring transfer learning and domain adaptation techniques using these resources can further augment NLP accuracy and application in diverse real-world scenarios for Marathi. Additionally, the creation of more comprehensive datasets addressing dialectal variations within Marathi could present novel research avenues, ultimately enriching the cultural discourse on computational linguistics.

The paper serves as a pivotal resource for practitioners and researchers aiming to explore computational methodologies for linguistically rich, resource-scarce languages such as Marathi.

PDF Markdown Bookmark Chat (Pro)

Authors (1)

Raviraj Joshi (76 papers)

Citations (51)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - l3cube-pune/MarathiNLP: Marathi NLP - is a repository dedicated to development of tools and resources for Marathi language. (105 stars)