L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models

Published 24 Jun 2023 in cs.CL and cs.LG | (2306.13888v1)

Abstract: The exploration of sentiment analysis in low-resource languages, such as Marathi, has been limited due to the availability of suitable datasets. In this work, we present L3Cube-MahaSent-MD, a multi-domain Marathi sentiment analysis dataset, with four different domains - movie reviews, general tweets, TV show subtitles, and political tweets. The dataset consists of around 60,000 manually tagged samples covering 3 distinct sentiments - positive, negative, and neutral. We create a sub-dataset for each domain comprising 15k samples. The MahaSent-MD is the first comprehensive multi-domain sentiment analysis dataset within the Indic sentiment landscape. We fine-tune different monolingual and multilingual BERT models on these datasets and report the best accuracy with the MahaBERT model. We also present an extensive in-domain and cross-domain analysis thus highlighting the need for low-resource multi-domain datasets. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .

Abstract PDF HTML Upgrade to Chat

Authors (5)

References (26)

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a multi-domain Marathi sentiment dataset with 60K manually tagged samples across four diverse domains.
It fine-tunes transformer models including MahaBERT, MuRIL, mBERT, and IndicBERT, with MahaBERT achieving top accuracy in sentiment classification.
The results highlight challenges in cross-domain generalization and promise enhanced sentiment analysis for low-resource language applications.

L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models

This essay presents an expert analysis of the paper titled "L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models." The paper introduces a novel and substantial dataset for sentiment analysis in the Marathi language—a significant Indian language with limited NLP resources. The research is pivotal in advancing sentiment analysis capabilities in low-resource linguistic environments.

Dataset Composition and Novelty

The L3Cube-MahaSent-MD dataset is a multi-domain sentiment analysis dataset in Marathi, encompassing approximately 60,000 manually tagged samples across four distinct domains: movie reviews, general tweets, TV show subtitles, and political tweets. Each sub-dataset consists of around 15,000 samples annotated with positive, negative, and neutral sentiments. The introduction of this diverse dataset fills a considerable gap in sentiment resources available for low-resource Indian languages. Specifically, three new domains—movie reviews, general tweets, and TV show subtitles—are added, constituting about 45,000 manually tagged sentences, making it a critical contribution to the field.

Methodology

The study fine-tunes several monolingual and multilingual BERT-based models, namely MahaBERT, MuRIL, mBERT, and IndicBERT, to evaluate their effectiveness on the mentioned datasets. Notably, the MahaBERT model emerges as the most proficient, exhibiting superior accuracy across the domains. The research also includes a cross-domain analysis, emphasizing the necessity of such varied datasets to achieve generalization in sentiment analysis in low-resource contexts.

Results and Findings

Quantitative evaluation highlights several key insights:

Model Performance: MahaBERT outperforms other models in sentiment classification tasks within individual domains, showing a notable improvement in accuracy scores. For instance, MahaBERT achieved 84.90% accuracy on political tweets (MahaSent-PT) and 78.80% on general tweets (MahaSent-GT).
Cross-domain Generalization: Domain-specific models generally performed poorly on out-domain datasets, reinforcing that Mahabert's multi-domain capability offers better generalization and cross-domain sentiment analysis potential.

Implications and Future Work

The release of the L3Cube-MahaSent-MD dataset and associated models has substantial implications for NLP in low-resource languages. Practically, it enables more nuanced sentiment analysis applications across various domains, such as media content analysis, public opinion monitoring, and user-generated content interpretation. Theoretically, it provides a robust basis for future research in model generalization across domains in low-resource languages.

Future work could explore expanding the multi-domain dataset further to include additional sentiment classes or domain types. Moreover, improving cross-domain model performance through techniques such as domain adaptation or meta-learning could enhance model versatility and applicability.

In conclusion, the research presented in the paper contributes significantly to advancing sentiment analysis in the Marathi language and offers a comprehensive dataset and modeling framework that can serve as a benchmark for further studies in low-resource language processing.

Markdown Report Issue