Automated Hate Speech Detection and the Problem of Offensive Language (1703.04009v1)

Published 11 Mar 2017 in cs.CL

Abstract: A key challenge for automatic hate-speech detection on social media is the separation of hate speech from other instances of offensive language. Lexical detection methods tend to have low precision because they classify all messages containing particular terms as hate speech and previous work using supervised learning has failed to distinguish between the two categories. We used a crowd-sourced hate speech lexicon to collect tweets containing hate speech keywords. We use crowd-sourcing to label a sample of these tweets into three categories: those containing hate speech, only offensive language, and those with neither. We train a multi-class classifier to distinguish between these different categories. Close analysis of the predictions and the errors shows when we can reliably separate hate speech from other offensive language and when this differentiation is more difficult. We find that racist and homophobic tweets are more likely to be classified as hate speech but that sexist tweets are generally classified as offensive. Tweets without explicit hate keywords are also more difficult to classify.

Citations (2,483)

View on Semantic Scholar

Summary

The paper demonstrates that logistic regression achieves an overall F1 score of 0.90 while revealing significant misclassification in hate speech detection.
It employs a manually annotated sample of 25,000 tweets with diverse textual, syntactic, readability, sentiment, and structural features for model development.
Despite promising overall performance, the model underperforms in specifically identifying hate speech, underscoring the need for more nuanced, context-sensitive training data.

Automated Hate Speech Detection and the Problem of Offensive Language

The paper "Automated Hate Speech Detection and the Problem of Offensive Language" by Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber addresses the critical issue of differentiating hate speech from offensive language in the context of social media platforms, specifically Twitter. This differentiation is vital for both ethical moderation and legal enforcement.

Introduction

The paper sets out to define hate speech, distinguishing it from other forms of offensive language. By drawing on legal contexts in various countries and internal guidelines from social media giants like Facebook and Twitter, the authors define hate speech as language expressing hatred towards a targeted group or intended to be derogatory, humiliating, or insulting. This definition excludes general offensive language often employed in different social contexts, such as vernacular speech among members of the same group.

Data Collection and Classification

The paper utilizes a crowd-sourced hate speech lexicon to identify potentially problematic tweets, gathering data from 33,458 Twitter users, which resulted in a corpus of 85.4 million tweets. A random sample of 25,000 tweets containing terms from the lexicon was manually classified by human coders into three categories: hate speech, offensive language, and neither. This labeling process had an intercoder agreement score of 92%, resulting in a dataset of 24,802 labeled tweets.

Features and Model

The authors employ several features derived from the tweets:

Text Features: Unigrams, bigrams, and trigrams weighted by TF-IDF.
Syntactic Features: POS tag unigrams, bigrams, and trigrams.
Readability Scores: Modified Flesch-Kincaid Grade Level and Flesch Reading Ease scores.
Sentiment Scores: Derived from a specialized social media sentiment lexicon.
Structural Features: Binary and count indicators for hashtags, mentions, retweets, and URLs.

Various models including logistic regression, naïve Bayes, decision trees, random forests, and linear SVMs were tested. Logistic regression with L2 regularization was chosen for the final model because it facilitated the examination of predicted probabilities and performed consistently well in comparative tests.

Results

The best-performing model achieved an overall precision, recall, and F1 score of 0.90. However, the precision and recall scores specifically for the hate speech category were 0.44 and 0.61, respectively, indicating significant misclassification. The model tended to classify tweets as less offensive or hateful than human coders, resulting in a high number of false negatives for hate speech but fewer false positives.

The analysis of misclassified tweets revealed several issues:

True Hate Speech Misclassified as Offensive: These often lacked the key terms associated with high hate speech probability.
True Offensive Language Misclassified as Hate Speech: These tweets contained multiple slurs or were quoting lyrics, demonstrating the challenge of context.
Innocuous Tweets Misclassified as Hate Speech: These tweets mentioned race or sexuality but were overall positive and misclassified due to isolated terms.

Implications and Future Work

The classification model effectively reduces errors by distinguishing hate speech from offensive language, thereby providing a more accurate tool for moderation and legal scrutiny. However, the high rate of misclassification, especially amongst less prevalent forms of hate speech, suggests the need for more nuanced training data that better represents diverse instances of hate speech.

Future work should:

Explore the social contexts and conversational dynamics in which hate speech occurs.
Investigate the characteristics and motivations of individuals who use hate speech.
Address subjective biases in human classification, especially concerning sexist language, to improve algorithmic fairness.

By refining these automated detection systems, future research can contribute to more effective and ethically sound moderation policies on social media platforms, ensuring that hate speech is precisely identified and appropriately addressed while minimizing unnecessary censorship of merely offensive language.

PDF Markdown

Related Papers

YouTube

Show All Videos