L3Cube-MahaHate: A Specialized Dataset and BERT Models for Marathi Hate Speech Detection
The paper introduces L3Cube-MahaHate, a pioneering dataset specifically designed for detecting hate speech in the Marathi language. This work addresses a significant gap in NLP resources for Marathi, an Indian language spoken by approximately 83 million people. Historically, hate speech detection efforts have been predominantly English-focused, and regional languages like Marathi have been relatively neglected. As the impact of hateful and offensive content on social media becomes increasingly evident, this paper's contribution is timely and relevant.
Dataset Construction and Features
L3Cube-MahaHate comprises over 25,000 Marathi tweets, organized into four categories: hate, offensive, profane, and not offensive. These categories capture a spectrum of hostility, from general abuse directed towards specific communities (HATE) to offensive language aimed at individuals (OFFN), to merely profane content (PRFN). The dataset's collection involved scraping tweets using search queries based on 150 commonly used offensive words in Marathi. This approach aligns with the need for context-specific datasets in NLP.
Annotators fluently speaking Marathi manually labeled the dataset, emphasizing the necessity of linguistic and cultural understanding in annotation tasks. The paper notes that controversial events often spurred virulent social media reactions, underscoring the dynamic nature of such datasets.
Methodological Approaches and Baseline Results
The authors evaluated various deep learning models to establish baseline performance benchmarks. These include Convolutional Neural Networks (CNNs), Long Short-Term Memory networks (LSTMs), and several BERT-based architectures, namely MahaBERT, IndicBERT, mBERT, and xlm-RoBERTa. The paper reveals a notable finding: monolingual models like MahaBERT outperform their multilingual counterparts in both binary and multi-class classification tasks. This reinforces the hypothesis that models tailored to specific languages or linguistic contexts yield superior performance due to nuanced language understanding.
The best-performing model, MahaBERT, achieved an accuracy of 90.9% in binary classification and 80.3% in four-class classification, indicating its robust capability in identifying different shades of hostile speech. These results are indicative of the dataset's quality and the model's efficiency.
Implications and Future Directions
The L3Cube-MahaHate dataset and its accompanying models represent a significant advancement in Marathi NLP. The results emphasize the merit of developing language-specific resources and models, especially for languages with rich vernacular forms and idiomatic expressions like Marathi. The utility extends beyond hate speech detection, as it can inspire subsequent work in sentiment analysis, topic classification, and broader NLP applications in Marathi.
The findings also suggest broader implications for multilingual NLP research. As digital communication becomes increasingly language-diverse, future work could explore cross-linguistic transfer learning and the development of resource-efficient models for low-resource languages. Collaborative efforts could focus on expanding datasets and benchmark evaluations across different social media platforms and contexts.
In conclusion, this paper makes a substantial contribution to the methodologies and resources available for hate speech detection in regional languages. It provides a solid foundation for further research in the field and highlights the importance of culturally and linguistically tailored approaches in the evolving landscape of NLP.