- The paper introduces AfriHate, a multilingual dataset that targets hate speech and abusive language detection across 15 African languages.
- It employs a rigorous data collection and annotation process using native speakers and social media inputs to capture cultural nuances.
- Baseline evaluations reveal strong performance from models like AfroXLMR-76L and GPT-4 in few-shot settings, highlighting improved moderation potential.
AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages
The paper "AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages" addresses critical gaps in the availability of multilingual datasets for hate speech and abusive language detection in African languages. The paper orchestrates the development of AfriHate, a diverse multilingual collection specifically curated to aid in identifying and moderating hate speech in 15 African languages. These languages include Algerian Arabic, Amharic, Igbo, Kinyarwanda, Hausa, Moroccan Arabic, Nigerian Pidgin, Oromo, Somali, Swahili, Tigrinya, Twi, isiXhosa, Yoruba, and isiZulu.
Data Collection & Annotation Process
The authors outline comprehensive methodologies employed in collecting and annotating datasets. A noteworthy aspect of the paper is its reliance on native speakers to annotate each dataset. These annotators possess intrinsic cultural and contextual understanding, thereby ensuring the accuracy and relatability of the labels—hate, abusive/offensive, or neutral. The data collection mechanism utilized involved streamlining data through keywords, hashtags, user handles, and locations across social media, primarily leveraging the Twitter API. The inclusion and challenges surrounding under-resourced African languages are meticulously explored.
Baseline Models & Results
In assessing the performance of various models on the curated datasets, the paper employs a suite of Africa-centric pre-trained LLMs (PLMs) and examines them against LLMs in both few-shot and zero-shot settings. AfroXLMR-76L emerges as a robust model with an average macro F1 score of approximately 78.16 in multilingual settings. In comparison with LLMs, GPT-4o marks a significant gap in few-shot settings with an improvement to a F1 score of 71.71 at 20-shot training, surpassing other models in terms of adaptability to multilingual hate speech detection in these languages.
Implications and Future Work
The introduction of AfriHate marks an instrumental step forward in filling the data scarcity gap concerning African languages for hate speech and abusive language moderation systems. The dataset serves as a critical resource for developing and testing LLMs tailored to the specific socio-cultural nuances inherent in African languages. Furthermore, the implications extend beyond practical applications to theoretical considerations as the dataset provides a basis for exploring algorithmic fairness and bias in multilingual NLP applications.
Future work, as postulated by the authors, will likely hone in on refining the balance among language representations within the dataset, as well as expanding on the inclusion of additional low-resource African languages. Moreover, exploring cross-lingual transfer learning potential and model generalization presents a prospective trajectory. The paper encourages continued collaboration with African communities to align dataset development with cultural sensitivities and ethical standards in AI research.
Conclusion
The AfriHate collection represents a substantial contribution to multilingual NLP resources, paving the way for nuanced analysis and moderation of hate speech across diverse cultural and linguistic landscapes within Africa. Through detailed annotation and model evaluation, the paper sets a benchmark in hate speech detection, showcasing the need for localized, culturally-sensitive datasets to address region-specific challenges in online content moderation.