Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

L3Cube-MahaHate: A Tweet-based Marathi Hate Speech Detection Dataset and BERT models (2203.13778v2)

Published 25 Mar 2022 in cs.CL and cs.LG

Abstract: Social media platforms are used by a large number of people prominently to express their thoughts and opinions. However, these platforms have contributed to a substantial amount of hateful and abusive content as well. Therefore, it is important to curb the spread of hate speech on these platforms. In India, Marathi is one of the most popular languages used by a wide audience. In this work, we present L3Cube-MahaHate, the first major Hate Speech Dataset in Marathi. The dataset is curated from Twitter, annotated manually. Our dataset consists of over 25000 distinct tweets labeled into four major classes i.e hate, offensive, profane, and not. We present the approaches used for collecting and annotating the data and the challenges faced during the process. Finally, we present baseline classification results using deep learning models based on CNN, LSTM, and Transformers. We explore mono-lingual and multi-lingual variants of BERT like MahaBERT, IndicBERT, mBERT, and xlm-RoBERTa and show that mono-lingual models perform better than their multi-lingual counterparts. The MahaBERT model provides the best results on L3Cube-MahaHate Corpus. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .

L3Cube-MahaHate: A Specialized Dataset and BERT Models for Marathi Hate Speech Detection

The paper introduces L3Cube-MahaHate, a pioneering dataset specifically designed for detecting hate speech in the Marathi language. This work addresses a significant gap in NLP resources for Marathi, an Indian language spoken by approximately 83 million people. Historically, hate speech detection efforts have been predominantly English-focused, and regional languages like Marathi have been relatively neglected. As the impact of hateful and offensive content on social media becomes increasingly evident, this paper's contribution is timely and relevant.

Dataset Construction and Features

L3Cube-MahaHate comprises over 25,000 Marathi tweets, organized into four categories: hate, offensive, profane, and not offensive. These categories capture a spectrum of hostility, from general abuse directed towards specific communities (HATE) to offensive language aimed at individuals (OFFN), to merely profane content (PRFN). The dataset's collection involved scraping tweets using search queries based on 150 commonly used offensive words in Marathi. This approach aligns with the need for context-specific datasets in NLP.

Annotators fluently speaking Marathi manually labeled the dataset, emphasizing the necessity of linguistic and cultural understanding in annotation tasks. The paper notes that controversial events often spurred virulent social media reactions, underscoring the dynamic nature of such datasets.

Methodological Approaches and Baseline Results

The authors evaluated various deep learning models to establish baseline performance benchmarks. These include Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), and several BERT-based architectures, namely MahaBERT, IndicBERT, mBERT, and xlm-RoBERTa. The paper reveals a notable finding: monolingual models like MahaBERT outperform their multilingual counterparts in both binary and multi-class classification tasks. This reinforces the hypothesis that models tailored to specific languages or linguistic contexts yield superior performance due to nuanced language understanding.

The best-performing model, MahaBERT, achieved an accuracy of 90.9% in binary classification and 80.3% in four-class classification, indicating its robust capability in identifying different shades of hostile speech. These results are indicative of the dataset's quality and the model's efficiency.

Implications and Future Directions

The L3Cube-MahaHate dataset and its accompanying models represent a significant advancement in Marathi NLP. The results emphasize the merit of developing language-specific resources and models, especially for languages with rich vernacular forms and idiomatic expressions like Marathi. The utility extends beyond hate speech detection, as it can inspire subsequent work in sentiment analysis, topic classification, and broader NLP applications in Marathi.

The findings also suggest broader implications for multilingual NLP research. As digital communication becomes increasingly language-diverse, future work could explore cross-linguistic transfer learning and the development of resource-efficient models for low-resource languages. Collaborative efforts could focus on expanding datasets and benchmark evaluations across different social media platforms and contexts.

In conclusion, this paper makes a substantial contribution to the methodologies and resources available for hate speech detection in regional languages. It provides a solid foundation for further research in the field and highlights the importance of culturally and linguistically tailored approaches in the evolving landscape of NLP.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Abhishek Velankar (4 papers)
  2. Hrushikesh Patil (7 papers)
  3. Amol Gore (2 papers)
  4. Shubham Salunke (2 papers)
  5. Raviraj Joshi (76 papers)
Citations (35)