- The paper introduces the Ekohate dataset to detect offensive language in code-switched Nigerian political tweets.
- It employs a dual annotation scheme, achieving a 95.1 F1-score in binary classification while noting challenges in finer four-label distinctions.
- Robust cross-corpus transfer experiments confirm the dataset’s broader applicability across diverse multilingual and cultural contexts.
Developing the Ekohate Dataset for Offensive Language Detection in Nigerian Political Discussions
Introduction to Ekohate
In response to the evident gap in resources for detecting offensive and hate speech in Nigerian political discourse, particularly on social platforms like Twitter, researchers developed the Ekohate dataset. This dataset focuses on political discussions surrounding the 2023 Lagos gubernatorial elections, involving major candidates from three political parties. Ekohate is specifically curated for code-switched content, integrating local dialects such as Yoruba and Nigerian Pidgin with English. Researchers collected 3,398 tweets, developing a dual annotation scheme assessing content as normal versus offensive, and further distinguishing between abusive, hateful, and contemptuous sub-categories.
Annotation Scheme and Dataset Analysis
The dataset was annotated with a binary label for normal versus offensive content, and a more detailed four-label schema classifying tweets as normal, abusive, hateful, or contemptuous. The inclusion of 'contempt' as a category addressed difficulties in classifying specific expressions that did not fit neatly into existing categories but indicated strong disdain.
Data Characteristics
- Multilinguality and Code-Switching: Tweets included a mix of English, Yoruba, and Nigerian Pidgin, with a significant portion featuring code-switching among these languages.
- Annotation Process: Two annotators trained in the task used the Label Studio platform, reaching a moderate inter-annotator agreement. Discordant tags were resolved through discussion.
Empirical Evaluation
Researchers conducted an extensive empirical analysis to validate the effectiveness of the dataset for training machine learning models to recognize offensive language.
Performance of Detection Models
- Binary and Multi-Label Classification: The paper reports a high F1-score of 95.1 in binary classification (normal vs. offensive). However, performance dropped in the four-label scheme, achieving a 70.3 F1-score, illustrating the challenge in finer-grained distinctions among categories.
- Cross-Linguistic and Cross-Dataset Generalization: The dataset demonstrated robust transfer capability, particularly in assessments involving similar datasets from other regions, including the US. This underscores the dataset's broader applicability beyond the immediate political and societal context of Lagos or Nigeria.
Cross-Corpus Transfer Learning
The dataset's utility was further evaluated through cross-corpus transfer learning experiments involving external datasets like OLID, HateUS2020, and FountaHate. These experiments assessed the model's generalization across different cultural and linguistic backdrops.
Key Findings
- Transfer Performance: The model trained on Ekohate generalized well to other datasets, and vice versa, suggesting that the training on Ekohate equips models with robust, contextually adaptable features for offensive language detection.
Conclusions and Future Work
Theoretical and Practical Implications
The creation of the Ekohate dataset addresses a critical need for localized tools in offensive language detection within African political contexts, enriching the global dataset landscape that previously underrepresented such demographics and linguistic profiles.
Speculations on Future AI Developments
The integration of multilingual and code-switching data presents new avenues for enhancing AI's linguistic and contextual sensitivity, crucial for applications in increasingly diverse global discourse spaces.
Access to Resources
As part of their commitment to fostering deeper and more widespread research engagement, the researchers have made the dataset and accompanying codebases publicly available on GitHub, encouraging further academic exploration and practical application development in this vital area of AI and machine learning research.