Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models (2306.13888v1)

Published 24 Jun 2023 in cs.CL and cs.LG

Abstract: The exploration of sentiment analysis in low-resource languages, such as Marathi, has been limited due to the availability of suitable datasets. In this work, we present L3Cube-MahaSent-MD, a multi-domain Marathi sentiment analysis dataset, with four different domains - movie reviews, general tweets, TV show subtitles, and political tweets. The dataset consists of around 60,000 manually tagged samples covering 3 distinct sentiments - positive, negative, and neutral. We create a sub-dataset for each domain comprising 15k samples. The MahaSent-MD is the first comprehensive multi-domain sentiment analysis dataset within the Indic sentiment landscape. We fine-tune different monolingual and multilingual BERT models on these datasets and report the best accuracy with the MahaBERT model. We also present an extensive in-domain and cross-domain analysis thus highlighting the need for low-resource multi-domain datasets. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .

L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models

This essay presents an expert analysis of the paper titled "L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models." The paper introduces a novel and substantial dataset for sentiment analysis in the Marathi language—a significant Indian language with limited NLP resources. The research is pivotal in advancing sentiment analysis capabilities in low-resource linguistic environments.

Dataset Composition and Novelty

The L3Cube-MahaSent-MD dataset is a multi-domain sentiment analysis dataset in Marathi, encompassing approximately 60,000 manually tagged samples across four distinct domains: movie reviews, general tweets, TV show subtitles, and political tweets. Each sub-dataset consists of around 15,000 samples annotated with positive, negative, and neutral sentiments. The introduction of this diverse dataset fills a considerable gap in sentiment resources available for low-resource Indian languages. Specifically, three new domains—movie reviews, general tweets, and TV show subtitles—are added, constituting about 45,000 manually tagged sentences, making it a critical contribution to the field.

Methodology

The paper fine-tunes several monolingual and multilingual BERT-based models, namely MahaBERT, MuRIL, mBERT, and IndicBERT, to evaluate their effectiveness on the mentioned datasets. Notably, the MahaBERT model emerges as the most proficient, exhibiting superior accuracy across the domains. The research also includes a cross-domain analysis, emphasizing the necessity of such varied datasets to achieve generalization in sentiment analysis in low-resource contexts.

Results and Findings

Quantitative evaluation highlights several key insights:

  • Model Performance: MahaBERT outperforms other models in sentiment classification tasks within individual domains, showing a notable improvement in accuracy scores. For instance, MahaBERT achieved 84.90% accuracy on political tweets (MahaSent-PT) and 78.80% on general tweets (MahaSent-GT).
  • Cross-domain Generalization: Domain-specific models generally performed poorly on out-domain datasets, reinforcing that Mahabert's multi-domain capability offers better generalization and cross-domain sentiment analysis potential.

Implications and Future Work

The release of the L3Cube-MahaSent-MD dataset and associated models has substantial implications for NLP in low-resource languages. Practically, it enables more nuanced sentiment analysis applications across various domains, such as media content analysis, public opinion monitoring, and user-generated content interpretation. Theoretically, it provides a robust basis for future research in model generalization across domains in low-resource languages.

Future work could explore expanding the multi-domain dataset further to include additional sentiment classes or domain types. Moreover, improving cross-domain model performance through techniques such as domain adaptation or meta-learning could enhance model versatility and applicability.

In conclusion, the research presented in the paper contributes significantly to advancing sentiment analysis in the Marathi language and offers a comprehensive dataset and modeling framework that can serve as a benchmark for further studies in low-resource language processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Sentiment analysis of mixed code for the transliterated hindi and marathi texts. volume 7, 04 2018. doi: 10.5121/ijnlc.2018.7202.
  2. Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp.  440–447, Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/P07-1056.
  3. Cohen, J. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46, 1960.
  4. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  5. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  6. Multi-domain tweet corpora for sentiment analysis: resource creation and evaluation. In Proceedings of the 12th Language Resources and Evaluation Conference, pp.  5046–5054, 2020.
  7. Hindimd: A multi-domain corpora for low-resource sentiment analysis. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.  7061–7070, 2022.
  8. Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pp. 4411–4421. PMLR, 2020.
  9. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pp.  216–225, 2014.
  10. Joshi, R. L3cube-mahacorpus and mahabert: Marathi monolingual corpus, marathi bert language models, and resources. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference, pp.  97–101, 2022a.
  11. Joshi, R. L3cube-mahanlp: Marathi natural language processing datasets, models, and library. arXiv preprint arXiv:2205.14728, 2022b.
  12. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  4948–4961, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.445. URL https://aclanthology.org/2020.findings-emnlp.445.
  13. Muril: Multilingual representations for indian languages. ArXiv, abs/2103.10730, 2021.
  14. L3cubemahasent: A marathi tweet-based sentiment analysis dataset. In Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 213–220, 2021.
  15. Experimental evaluation of deep learning models for marathi text classification. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2021, pp.  605–613. Springer, 2022.
  16. A survey on nlp resources, tools, and techniques for marathi language processing. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(2):1–34, 2022.
  17. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1eA7AEtvS.
  18. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  19. L3cube-mahaner: A marathi named entity recognition dataset and bert models. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference, pp.  29–34, 2022.
  20. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL https://aclanthology.org/P11-1015.
  21. Thumbs up? sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pp.  79–86. Association for Computational Linguistics, July 2002. doi: 10.3115/1118693.1118704. URL https://aclanthology.org/W02-1011.
  22. L3cube-mahahate: A tweet-based marathi hate speech detection dataset and bert models. In Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), pp.  1–9, 2022.
  23. Cross-lingual sentiment analysis for indian languages using linked wordnets. pp.  73–82, 12 2012.
  24. Multi-source multi-domain sentiment analysis with BERT-based models. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.  581–589, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.62.
  25. Mono vs multilingual bert for hate speech detection and text classification: A case study in marathi. In Artificial Neural Networks in Pattern Recognition: 10th IAPR TC3 Workshop, ANNPR 2022, Dubai, United Arab Emirates, November 24–26, 2022, Proceedings, pp.  121–128. Springer, 2022.
  26. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1253, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Aabha Pingle (3 papers)
  2. Aditya Vyawahare (4 papers)
  3. Isha Joshi (6 papers)
  4. Rahul Tangsali (4 papers)
  5. Raviraj Joshi (76 papers)
Citations (5)