EkoHate: Abusive Language and Hate Speech Detection for Code-switched Political Discussions on Nigerian Twitter (2404.18180v1)

Published 28 Apr 2024 in cs.CL

Abstract: Nigerians have a notable online presence and actively discuss political and topical matters. This was particularly evident throughout the 2023 general election, where Twitter was used for campaigning, fact-checking and verification, and even positive and negative discourse. However, little or none has been done in the detection of abusive language and hate speech in Nigeria. In this paper, we curated code-switched Twitter data directed at three musketeers of the governorship election on the most populous and economically vibrant state in Nigeria; Lagos state, with the view to detect offensive speech in political discussions. We developed EkoHate -- an abusive language and hate speech dataset for political discussions between the three candidates and their followers using a binary (normal vs offensive) and fine-grained four-label annotation scheme. We analysed our dataset and provided an empirical evaluation of state-of-the-art methods across both supervised and cross-lingual transfer learning settings. In the supervised setting, our evaluation results in both binary and four-label annotation schemes show that we can achieve 95.1 and 70.3 F1 points respectively. Furthermore, we show that our dataset adequately transfers very well to three publicly available offensive datasets (OLID, HateUS2020, and FountaHate), generalizing to political discussions in other regions like the US.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces the Ekohate dataset to detect offensive language in code-switched Nigerian political tweets.
It employs a dual annotation scheme, achieving a 95.1 F1-score in binary classification while noting challenges in finer four-label distinctions.
Robust cross-corpus transfer experiments confirm the dataset’s broader applicability across diverse multilingual and cultural contexts.

Developing the Ekohate Dataset for Offensive Language Detection in Nigerian Political Discussions

Introduction to Ekohate

In response to the evident gap in resources for detecting offensive and hate speech in Nigerian political discourse, particularly on social platforms like Twitter, researchers developed the Ekohate dataset. This dataset focuses on political discussions surrounding the 2023 Lagos gubernatorial elections, involving major candidates from three political parties. Ekohate is specifically curated for code-switched content, integrating local dialects such as Yoruba and Nigerian Pidgin with English. Researchers collected 3,398 tweets, developing a dual annotation scheme assessing content as normal versus offensive, and further distinguishing between abusive, hateful, and contemptuous sub-categories.

Annotation Scheme and Dataset Analysis

The dataset was annotated with a binary label for normal versus offensive content, and a more detailed four-label schema classifying tweets as normal, abusive, hateful, or contemptuous. The inclusion of 'contempt' as a category addressed difficulties in classifying specific expressions that did not fit neatly into existing categories but indicated strong disdain.

Data Characteristics

Multilinguality and Code-Switching: Tweets included a mix of English, Yoruba, and Nigerian Pidgin, with a significant portion featuring code-switching among these languages.
Annotation Process: Two annotators trained in the task used the Label Studio platform, reaching a moderate inter-annotator agreement. Discordant tags were resolved through discussion.

Empirical Evaluation

Researchers conducted an extensive empirical analysis to validate the effectiveness of the dataset for training machine learning models to recognize offensive language.

Performance of Detection Models

Binary and Multi-Label Classification: The paper reports a high F1-score of 95.1 in binary classification (normal vs. offensive). However, performance dropped in the four-label scheme, achieving a 70.3 F1-score, illustrating the challenge in finer-grained distinctions among categories.
Cross-Linguistic and Cross-Dataset Generalization: The dataset demonstrated robust transfer capability, particularly in assessments involving similar datasets from other regions, including the US. This underscores the dataset's broader applicability beyond the immediate political and societal context of Lagos or Nigeria.

Cross-Corpus Transfer Learning

The dataset's utility was further evaluated through cross-corpus transfer learning experiments involving external datasets like OLID, HateUS2020, and FountaHate. These experiments assessed the model's generalization across different cultural and linguistic backdrops.

Key Findings

Transfer Performance: The model trained on Ekohate generalized well to other datasets, and vice versa, suggesting that the training on Ekohate equips models with robust, contextually adaptable features for offensive language detection.

Conclusions and Future Work

Theoretical and Practical Implications

The creation of the Ekohate dataset addresses a critical need for localized tools in offensive language detection within African political contexts, enriching the global dataset landscape that previously underrepresented such demographics and linguistic profiles.

Speculations on Future AI Developments

The integration of multilingual and code-switching data presents new avenues for enhancing AI's linguistic and contextual sensitivity, crucial for applications in increasingly diverse global discourse spaces.

Access to Resources

As part of their commitment to fostering deeper and more widespread research engagement, the researchers have made the dataset and accompanying codebases publicly available on GitHub, encouraging further academic exploration and practical application development in this vital area of AI and machine learning research.

Related Papers

Tweets

https://twitter.com/davlanade/status/1785308015396028798

https://twitter.com/arxivsanitybot/status/1785491082815647959