Improving Adversarial Data Collection for German Hate Speech Detection
Introduction
Detecting hate speech is a critical aspect of maintaining the safety and integrity of online spaces. Traditional datasets, derived from social media or comments sections, often contain biases that result in models lacking robustness and generalizability. This research introduces the German Adversarial Hate speech Dataset (GAHD), focusing on enhancing the diversity and efficiency of adversarial examples through unique strategies supporting annotators.
Dataset Creation and Annotation
GAHD's creation involved a dynamic adversarial data collection (DADC) process across four rounds, each employing a distinct strategy to aid annotators in crafting or identifying adversarial examples. The dataset encompasses approximately 11,000 examples, with a balanced representation of hate speech and non-hate speech categories. Notably, the annotation process included a detailed definition of hate speech tailored to the German context, emphasizing cultural nuances and inclusive of marginalized groups.
Strategies for Adversarial Data Collection
- Unguided Example Generation: The initial round allowed annotators to freely generate examples, fostering creativity but also revealing challenges in consistently applying hate speech definitions.
- Translation and Validation: Subsequent rounds leveraged translated adversarial examples from English datasets and sentences from German newspapers presumed to be benign but flagged by models as hate speech, providing a rich source of potential adversarial instances.
- Contrastive Example Creation: The final round focused on generating examples expressly designed to challenge the model's predictions, refining the dataset's ability to test and enhance model robustness.
Dynamic Adversarial Data Collection Process
The iterative nature of DADC ensured continuous refinement of the target model, with each round incorporating newly collected adversarial examples into the training data. This method not only improved the dataset's quality but also allowed for an examination of different annotation support strategies on the efficiency and diversity of generated examples.
Model Evaluations and Benchmarks
GAHD presented a significant challenge to state-of-the-art hate speech detection models, including commercial APIs and LLMs. Notably, training models on GAHD resulted in substantial improvements in robustness, as evidenced by performance on both in-domain and out-of-domain test sets. The analysis also highlighted the varying effectiveness of adversarial examples generated through different support strategies, underscoring the value of mixing multiple strategies to produce a more resilient and comprehensive dataset.
Implications and Future Directions
The research demonstrates the viability and benefit of employing diversified strategies in adversarial data collection to improve hate speech detection models. By supporting annotators in generating more diverse and challenging examples, the resulting dataset offers a robust resource for training and evaluating hate speech detection models. Future work could explore additional methods for annotator support, including leveraging LLMs for augmentations and perturbations, to further enhance dataset diversity and model performance.
Conclusion
GAHD marks a significant advancement in the collection of adversarial data for hate speech detection, emphasizing the importance of diverse and efficient example generation. The strategies outlined in this paper not only contribute to the development of more robust models but also offer insights into optimizing the adversarial data collection process for future research.