Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset (2403.19559v1)

Published 28 Mar 2024 in cs.CL

Abstract: Hate speech detection models are only as good as the data they are trained on. Datasets sourced from social media suffer from systematic gaps and biases, leading to unreliable models with simplistic decision boundaries. Adversarial datasets, collected by exploiting model weaknesses, promise to fix this problem. However, adversarial data collection can be slow and costly, and individual annotators have limited creativity. In this paper, we introduce GAHD, a new German Adversarial Hate speech Dataset comprising ca.\ 11k examples. During data collection, we explore new strategies for supporting annotators, to create more diverse adversarial examples more efficiently and provide a manual analysis of annotator disagreements for each strategy. Our experiments show that the resulting dataset is challenging even for state-of-the-art hate speech detection models, and that training on GAHD clearly improves model robustness. Further, we find that mixing multiple support strategies is most advantageous. We make GAHD publicly available at https://github.com/jagol/gahd.

References (57)

Citations (5)

View on Semantic Scholar

Summary

The paper presents a novel dataset creation method using four rounds of adversarial data collection to enhance hate speech detection.
It employs diverse annotator support strategies, such as translation and contrastive example generation, to improve data quality.
Models trained on GAHD achieved enhanced robustness on both in-domain and out-of-domain tests, validating the approach.

Improving Adversarial Data Collection for German Hate Speech Detection

Introduction

Detecting hate speech is a critical aspect of maintaining the safety and integrity of online spaces. Traditional datasets, derived from social media or comments sections, often contain biases that result in models lacking robustness and generalizability. This research introduces the German Adversarial Hate speech Dataset (GAHD), focusing on enhancing the diversity and efficiency of adversarial examples through unique strategies supporting annotators.

Dataset Creation and Annotation

GAHD's creation involved a dynamic adversarial data collection (DADC) process across four rounds, each employing a distinct strategy to aid annotators in crafting or identifying adversarial examples. The dataset encompasses approximately 11,000 examples, with a balanced representation of hate speech and non-hate speech categories. Notably, the annotation process included a detailed definition of hate speech tailored to the German context, emphasizing cultural nuances and inclusive of marginalized groups.

Strategies for Adversarial Data Collection

Unguided Example Generation: The initial round allowed annotators to freely generate examples, fostering creativity but also revealing challenges in consistently applying hate speech definitions.
Translation and Validation: Subsequent rounds leveraged translated adversarial examples from English datasets and sentences from German newspapers presumed to be benign but flagged by models as hate speech, providing a rich source of potential adversarial instances.
Contrastive Example Creation: The final round focused on generating examples expressly designed to challenge the model's predictions, refining the dataset's ability to test and enhance model robustness.

Dynamic Adversarial Data Collection Process

The iterative nature of DADC ensured continuous refinement of the target model, with each round incorporating newly collected adversarial examples into the training data. This method not only improved the dataset's quality but also allowed for an examination of different annotation support strategies on the efficiency and diversity of generated examples.

Model Evaluations and Benchmarks

GAHD presented a significant challenge to state-of-the-art hate speech detection models, including commercial APIs and LLMs. Notably, training models on GAHD resulted in substantial improvements in robustness, as evidenced by performance on both in-domain and out-of-domain test sets. The analysis also highlighted the varying effectiveness of adversarial examples generated through different support strategies, underscoring the value of mixing multiple strategies to produce a more resilient and comprehensive dataset.

Implications and Future Directions

The research demonstrates the viability and benefit of employing diversified strategies in adversarial data collection to improve hate speech detection models. By supporting annotators in generating more diverse and challenging examples, the resulting dataset offers a robust resource for training and evaluating hate speech detection models. Future work could explore additional methods for annotator support, including leveraging LLMs for augmentations and perturbations, to further enhance dataset diversity and model performance.

Conclusion

GAHD marks a significant advancement in the collection of adversarial data for hate speech detection, emphasizing the importance of diverse and efficient example generation. The strategies outlined in this paper not only contribute to the development of more robust models but also offer insights into optimizing the adversarial data collection process for future research.

PDF Markdown

Tweets

https://twitter.com/jagoldz/status/1775093755130876090