L3Cube-MahaSent-MD: A Multi-domain Marathi Sentiment Analysis Dataset and Transformer Models (2306.13888v1)
Abstract: The exploration of sentiment analysis in low-resource languages, such as Marathi, has been limited due to the availability of suitable datasets. In this work, we present L3Cube-MahaSent-MD, a multi-domain Marathi sentiment analysis dataset, with four different domains - movie reviews, general tweets, TV show subtitles, and political tweets. The dataset consists of around 60,000 manually tagged samples covering 3 distinct sentiments - positive, negative, and neutral. We create a sub-dataset for each domain comprising 15k samples. The MahaSent-MD is the first comprehensive multi-domain sentiment analysis dataset within the Indic sentiment landscape. We fine-tune different monolingual and multilingual BERT models on these datasets and report the best accuracy with the MahaBERT model. We also present an extensive in-domain and cross-domain analysis thus highlighting the need for low-resource multi-domain datasets. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .
- Sentiment analysis of mixed code for the transliterated hindi and marathi texts. volume 7, 04 2018. doi: 10.5121/ijnlc.2018.7202.
- Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 440–447, Prague, Czech Republic, June 2007. Association for Computational Linguistics. URL https://aclanthology.org/P07-1056.
- Cohen, J. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46, 1960.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Multi-domain tweet corpora for sentiment analysis: resource creation and evaluation. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 5046–5054, 2020.
- Hindimd: A multi-domain corpora for low-resource sentiment analysis. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 7061–7070, 2022.
- Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pp. 4411–4421. PMLR, 2020.
- Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, volume 8, pp. 216–225, 2014.
- Joshi, R. L3cube-mahacorpus and mahabert: Marathi monolingual corpus, marathi bert language models, and resources. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference, pp. 97–101, 2022a.
- Joshi, R. L3cube-mahanlp: Marathi natural language processing datasets, models, and library. arXiv preprint arXiv:2205.14728, 2022b.
- IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4948–4961, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.445. URL https://aclanthology.org/2020.findings-emnlp.445.
- Muril: Multilingual representations for indian languages. ArXiv, abs/2103.10730, 2021.
- L3cubemahasent: A marathi tweet-based sentiment analysis dataset. In Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 213–220, 2021.
- Experimental evaluation of deep learning models for marathi text classification. In Proceedings of the 2nd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2021, pp. 605–613. Springer, 2022.
- A survey on nlp resources, tools, and techniques for marathi language processing. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(2):1–34, 2022.
- Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1eA7AEtvS.
- Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- L3cube-mahaner: A marathi named entity recognition dataset and bert models. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference, pp. 29–34, 2022.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL https://aclanthology.org/P11-1015.
- Thumbs up? sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pp. 79–86. Association for Computational Linguistics, July 2002. doi: 10.3115/1118693.1118704. URL https://aclanthology.org/W02-1011.
- L3cube-mahahate: A tweet-based marathi hate speech detection dataset and bert models. In Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), pp. 1–9, 2022.
- Cross-lingual sentiment analysis for indian languages using linked wordnets. pp. 73–82, 12 2012.
- Multi-source multi-domain sentiment analysis with BERT-based models. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 581–589, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.62.
- Mono vs multilingual bert for hate speech detection and text classification: A case study in marathi. In Artificial Neural Networks in Pattern Recognition: 10th IAPR TC3 Workshop, ANNPR 2022, Dubai, United Arab Emirates, November 24–26, 2022, Proceedings, pp. 121–128. Springer, 2022.
- Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1253, 2018.
- Aabha Pingle (3 papers)
- Aditya Vyawahare (4 papers)
- Isha Joshi (6 papers)
- Rahul Tangsali (4 papers)
- Raviraj Joshi (76 papers)