A Benchmark Dataset for Learning to Intervene in Online Hate Speech (1909.04251v1)

Published 10 Sep 2019 in cs.CL, cs.AI, and cs.CY

Abstract: Countering online hate speech is a critical yet challenging task, but one which can be aided by the use of NLP techniques. Previous research has primarily focused on the development of NLP methods to automatically and effectively detect online hate speech while disregarding further action needed to calm and discourage individuals from using hate speech in the future. In addition, most existing hate speech datasets treat each post as an isolated instance, ignoring the conversational context. In this paper, we propose a novel task of generative hate speech intervention, where the goal is to automatically generate responses to intervene during online conversations that contain hate speech. As a part of this work, we introduce two fully-labeled large-scale hate speech intervention datasets collected from Gab and Reddit. These datasets provide conversation segments, hate speech labels, as well as intervention responses written by Mechanical Turk Workers. In this paper, we also analyze the datasets to understand the common intervention strategies and explore the performance of common automatic response generation methods on these new datasets to provide a benchmark for future research.

Authors (5)

Jing Qian (81 papers)
Anna Bethke (2 papers)
Yinyin Liu (4 papers)
Elizabeth Belding (18 papers)
William Yang Wang (254 papers)

Citations (206)

View on Semantic Scholar

Summary

A Benchmark Dataset for Learning to Intervene in Online Hate Speech: A Technical Examination

The outlined research addresses a contemporary challenge in the domain of NLP related to mitigating online hate speech. The authors propose a novel task focused on generative hate speech intervention and provide a robust dataset to support this endeavor. This approach emphasizes generating contextually appropriate interventions in ongoing online conversations identified to contain hate speech, diverging from traditional detection-only models that fail to address conversational context post-identification.

Core Contributions

Dataset Creation: The authors have developed two comprehensive datasets sourced from Reddit and Gab, platforms known for polarizing content. These datasets are distinct in retaining conversational contexts and include manually labeled hate speech indications and human-written intervention responses.
Task Definition: The introduction of the generative hate speech intervention task marks a pivotal shift towards using NLP to interact constructively post-detection, rather than mere identification and subsequent isolation.
Empirical Analysis: Analysis of the datasets reveals varied intervention strategies, highlighting the complexity and nuanced human approach to countering hate speech, which could be instructive for model training.

Dataset Overview

The dataset collated includes 5,020 conversations from Reddit and 11,825 from Gab. The inclusion of Mechanical Turk workers provides both hate speech labeling and intervention strategies. These interventions are hand-crafted responses aimed at dampening the detected hate speech within those discussions. The data, therefore, provides a dual utility for both hate speech detection and intervention training.

Experimental Setup and Methods

The exploration of several baseline models for hate speech detection (e.g., Logistic Regression, SVM, CNN, RNN) demonstrates an incline toward using pretrained word embeddings for enhanced neural network model performance. The generative task employs RL, Seq2Seq, and VAE methodologies. These models were assessed via automatic metrics (BLEU, ROUGE, METEOR) and qualitative human evaluations, revealing discrepancies between traditional metric assessments and human-judged effectiveness.

Findings and Implications

Models' Performance: The RL model, despite scoring modest in automatic metrics, was favored in human evaluations for generating effective, varied responses. This underlines a gap between current evaluation metrics and human perception in conversational AI tasks.
Dataset Diversity: The different characteristics between Reddit and Gab datasets reflect the need for adaptable models capable of handling data variance inherent in different social platforms.
Future Directions: Encouraging research into models that can engage in context-aware interventions dynamically, adapting on-the-fly to conversational progressions, could tap into uncharted potential for AI moderation systems.

Conclusion

This paper establishes a foundational benchmark for researchers to pivot from traditional detection to proactive engagement within hostile online environments using generative models. The introduction of fully-labeled, context-rich datasets serves as a crucial resource for advancing AI's role in cultivating more constructive online spaces. Future inquiries might explore how the integration of more sophisticated cognitive architectures or reinforcement learning strategies can hone these AI systems, ultimately striving for interventions that reflect the detailed nuance of human-mediated discourse.

PDF Markdown

Related Papers

Find Related Papers