Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection (2012.15761v2)

Published 31 Dec 2020 in cs.CL and cs.LG

Abstract: We present a human-and-model-in-the-loop process for dynamically generating datasets and training better performing and more robust hate detection models. We provide a new dataset of ~40,000 entries, generated and labelled by trained annotators over four rounds of dynamic data creation. It includes ~15,000 challenging perturbations and each hateful entry has fine-grained labels for the type and target of hate. Hateful entries make up 54% of the dataset, which is substantially higher than comparable datasets. We show that model performance is substantially improved using this approach. Models trained on later rounds of data collection perform better on test sets and are harder for annotators to trick. They also perform better on HateCheck, a suite of functional tests for online hate detection. We provide the code, dataset and annotation guidelines for other researchers to use. Accepted at ACL 2021.

PDF Abstract

Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection

The paper by Vidgen et al. presents an innovative approach to improving hate speech detection through dynamically generated datasets. The research introduces a "human-and-model-in-the-loop" methodology, which iteratively leverages human intelligence and machine learning models to enhance the performance and robustness of hate detection systems. The key contribution is a dynamically created dataset comprising approximately 40,000 entries specifically designed for this purpose. Notably, the dataset includes around 15,000 challenging perturbations, adding a complexity dimension favorable for training more robust models. The dataset is rich with 54% of entries classified as hateful, providing a substantial challenge compared to existing benchmarks.

Methodological Innovations

The approach involves multiple rounds of interactive data collection and model training. Initially, a classifier is trained on existing hate speech datasets and is subsequently challenged by human annotators to identify weaknesses by supplying instances that are prone to misclassification. Each subsequent round incorporates model retraining with new data, further refining the model's ability to precisely detect hate speech. The cycle of data generation and model training is continued over four rounds, forming a loop that progressively strengthens both dataset and model.

The rounds begin with synthetic data creation by annotators in the first round, then expand to include "contrast sets" of perturbations in the later rounds. These perturbations are crafted to maintain the original text's lexical artifacts while changing the label, creating an adversarial environment to train the model deeper in recognizing subtleties in language.

Numerical Results and Claims

Quantitatively, the model's accuracy improved notably when trained on later rounds of data, exhibiting enhanced precision on test sets as well as on HateCheck, an independent evaluation suite for online hate detection. For instance, accuracy improved from 60% in R1 to 95% in R4 on the HateCheck functional tests, highlighting significant advancements in robustness and generalization capabilities of the trained models.

Implications and Future Directions

The research has profound implications for the field of NLP and AI, especially within the domain of hate speech detection. By leveraging dynamic benchmarking and contrast sets, this methodology can reduce overfitting and improve model robustness against adversarial content—an increasingly critical task as social media platforms strive for more refined content moderation tools.

Theoretically, this paper opens avenues for exploring how adversarial data generation can be applied to other domains, such as sentiment analysis or misinformation detection, where models similarly struggle with subtle distinctions. Practically, the development of datasets that are adaptively generated in response to evolving online discourse presents significant potential to keep pace with the rapidly changing landscape of online communications.

In future developments, expanding the diversity of annotators and exploring alternative models, beyond RoBERTa, within the loop could provide further enhancements. Extending this approach to multi-lingual settings or cross-domain applications could also be explored to increase generalizability globally. The open access to data, code, and guidelines provided by the authors sets a precedent for collaborative advances in this challenging field.

The paper contributes a significant step forward in the ongoing development of NLP tools designed to facilitate safer online communities by addressing the implicit and explicit biases present in online communications through a rigorously generated dynamic dataset.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Bertie Vidgen (35 papers)
Tristan Thrush (23 papers)
Zeerak Waseem (7 papers)
Douwe Kiela (85 papers)

Citations (211)

View on Semantic Scholar

Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection (2012.15761v2)