Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library (2205.14728v2)

Published 29 May 2022 in cs.CL and cs.LG

Abstract: Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. Moreover, popular NLP libraries do not have support for the Marathi language. With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing. We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection. We have also published a monolingual Marathi corpus for unsupervised LLMing tasks. Overall we present MahaCorpus, MahaSent, MahaNER, and MahaHate datasets and their corresponding MahaBERT models fine-tuned on these datasets. We aim to move ahead of benchmark datasets and prepare useful resources for Marathi. The resources are available at https://github.com/l3cube-pune/MarathiNLP.

An Overview of L3Cube-MahaNLP: Datasets and Models for Marathi NLP

The paper "L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library" introduces a comprehensive resource suite aimed at enhancing Marathi NLP. Despite Marathi being one of the most spoken languages in India, the language suffers from a paucity of NLP resources—an issue that this work seeks to address. This scholarly contribution is particularly notable for its focus on low-resource languages, creating essential tools and datasets which may facilitate a variety of NLP tasks.

Core Contributions

  1. Datasets: The paper offers several curated datasets:
    • MahaCorpus: A large monolingual corpus consisting of 24.8 million sentences, amounting to 289 million tokens. This dataset supports unsupervised LLMing, providing foundational text data sourced mainly from both news and non-news Marathi content.
    • MahaSent: A sentiment analysis dataset with sentiment-labeled Marathi tweets, containing 12,114 training, 2,250 test, and 1,500 validation samples.
    • MahaNER: A named entity recognition dataset comprising 25,000 manually tagged sentences that span eight entity classes.
    • MahaHate: A hate speech detection dataset with over 25,000 tweets annotated into categories such as hate, offensive, profane, and neutral.
  2. Models: L3Cube-MahaNLP also presents various fine-tuned Transformer models:
    • MahaBERT variants (MahaBERT, MahaAlBERT, MahaRoBERTa): Monolingual models trained using the MahaCorpus with the Masked LLMing (MLM) objective, tailored specifically for effective Marathi language understanding.
    • MahaGPT: A generative transformer model using a causal LLMing approach, applied to the full Marathi corpus for generating and predicting text in Marathi.
    • MahaFT: FastText word embeddings tailored for Marathi, enabling efficient utilization in multiple NLP applications.

Implications and Future Prospects

The development of these datasets and models represents a significant advancement in the capabilities available for Marathi NLP. By outperforming existing multilingual models in various Marathi NLP tasks, these monolingual models demonstrate the added value of focused linguistic resources. The availability of these datasets and models facilitates robust research and practical applications such as sentiment analysis, entity recognition, and hate speech detection in Marathi.

Going forward, the authors aim to expand further into domains such as natural language generation, while also streamlining access to these models via Python packages. This progression could further strengthen the NLP landscape for low-resource languages, advocating for stronger representation and utility within the global research community.

Overall, the L3Cube-MahaNLP initiative not only enriches Marathi NLP but also provides a model of resource development for other low-resource languages, potentially paving the way for enhanced linguistic processing across diverse linguistic domains.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Raviraj Joshi (76 papers)
Citations (23)