Emergent Mind

Abstract

Hate speech detection models are only as good as the data they are trained on. Datasets sourced from social media suffer from systematic gaps and biases, leading to unreliable models with simplistic decision boundaries. Adversarial datasets, collected by exploiting model weaknesses, promise to fix this problem. However, adversarial data collection can be slow and costly, and individual annotators have limited creativity. In this paper, we introduce GAHD, a new German Adversarial Hate speech Dataset comprising ca.\ 11k examples. During data collection, we explore new strategies for supporting annotators, to create more diverse adversarial examples more efficiently and provide a manual analysis of annotator disagreements for each strategy. Our experiments show that the resulting dataset is challenging even for state-of-the-art hate speech detection models, and that training on GAHD clearly improves model robustness. Further, we find that mixing multiple support strategies is most advantageous. We make GAHD publicly available at https://github.com/jagol/gahd.
Workflow showing annotators validating translations of adversarial English examples in the DADC process for R2.

Overview

  • The research introduces the German Adversarial Hate Speech Dataset (GAHD) aimed at improving hate speech detection by enhancing the diversity and efficiency of adversarial examples.

  • GAHD incorporates a dynamic adversarial data collection (DADC) process over four rounds, each applying different strategies to support annotators in generating or identifying adversarial examples.

  • The dataset contains around 11,000 examples, balancing hate speech and non-hate speeches, with a focus on the German cultural context and inclusivity of marginalized groups.

  • Model evaluations on GAHD showed that training on this dataset significantly improves the robustness of hate speech detection models, challenging both commercial APIs and LLMs.

Introduction

Detecting hate speech is a critical aspect of maintaining the safety and integrity of online spaces. Traditional datasets, derived from social media or comments sections, often contain biases that result in models lacking robustness and generalizability. This research introduces the German Adversarial Hate speech Dataset (GAHD), focusing on enhancing the diversity and efficiency of adversarial examples through unique strategies supporting annotators.

Dataset Creation and Annotation

GAHD's creation involved a dynamic adversarial data collection (DADC) process across four rounds, each employing a distinct strategy to aid annotators in crafting or identifying adversarial examples. The dataset encompasses approximately 11,000 examples, with a balanced representation of hate speech and non-hate speech categories. Notably, the annotation process included a detailed definition of hate speech tailored to the German context, emphasizing cultural nuances and inclusive of marginalized groups.

Strategies for Adversarial Data Collection

  • Unguided Example Generation: The initial round allowed annotators to freely generate examples, fostering creativity but also revealing challenges in consistently applying hate speech definitions.

  • Translation and Validation: Subsequent rounds leveraged translated adversarial examples from English datasets and sentences from German newspapers presumed to be benign but flagged by models as hate speech, providing a rich source of potential adversarial instances.

  • Contrastive Example Creation: The final round focused on generating examples expressly designed to challenge the model's predictions, refining the dataset's ability to test and enhance model robustness.

Dynamic Adversarial Data Collection Process

The iterative nature of DADC ensured continuous refinement of the target model, with each round incorporating newly collected adversarial examples into the training data. This method not only improved the dataset's quality but also allowed for an examination of different annotation support strategies on the efficiency and diversity of generated examples.

Model Evaluations and Benchmarks

GAHD presented a significant challenge to state-of-the-art hate speech detection models, including commercial APIs and LLMs. Notably, training models on GAHD resulted in substantial improvements in robustness, as evidenced by performance on both in-domain and out-of-domain test sets. The analysis also highlighted the varying effectiveness of adversarial examples generated through different support strategies, underscoring the value of mixing multiple strategies to produce a more resilient and comprehensive dataset.

Implications and Future Directions

The research demonstrates the viability and benefit of employing diversified strategies in adversarial data collection to improve hate speech detection models. By supporting annotators in generating more diverse and challenging examples, the resulting dataset offers a robust resource for training and evaluating hate speech detection models. Future work could explore additional methods for annotator support, including leveraging LLMs for augmentations and perturbations, to further enhance dataset diversity and model performance.

Conclusion

GAHD marks a significant advancement in the collection of adversarial data for hate speech detection, emphasizing the importance of diverse and efficient example generation. The strategies outlined in this paper not only contribute to the development of more robust models but also offer insights into optimizing the adversarial data collection process for future research.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

References
  1. Adversarial attacks and defenses for social network text processing applications: Techniques, challenges and future research directions. ArXiv: 2110.13980.
  2. RP-Mod & RP-Crowd: Moderator- and crowd-annotated german news comment datasets. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1.
  3. Beat the AI: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics, 8:662–678.
  4. Improving question answering model robustness with synthetic adversarial data generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8830–8848, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  5. Models in the loop: Aiding crowdworkers with generative annotation assistants. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3754–3767, Seattle, United States. Association for Computational Linguistics.
  6. Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6:587–604.
  7. Rui Cao and Roy Ka-Wei Lee. 2020. HateGAN: Adversarial generative-based data augmentation for hate speech detection. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6327–6338, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  8. German’s next language model. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6788–6796, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  9. Detox: A comprehensive dataset for German offensive language and conversation analysis. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH), pages 143–153, Seattle, Washington (Hybrid). Association for Computational Linguistics.
  10. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4537–4546, Hong Kong, China. Association for Computational Linguistics.
  11. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota. Association for Computational Linguistics.
  12. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, volume 96, pages 226–231.
  13. Paula Fortuna and Sérgio Nunes. 2018. A survey on automatic detection of hate speech in text. ACM Comput. Surv., 51(4).
  14. Large scale crowdsourcing and characterization of twitter abusive behavior. Proceedings of the International AAAI Conference on Web and Social Media, 12(1).
  15. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1307–1323, Online. Association for Computational Linguistics.
  16. Building large monolingual dictionaries at the Leipzig corpora collection: From 100 to 200 languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 759–765, Istanbul, Turkey. European Language Resources Association (ELRA).
  17. Hateversarial: Adversarial attack against hate speech detection algorithms on twitter. In Proceedings of the 30th ACM Conference on User Modeling, Adaptation and Personalization, UMAP ’22, page 143–152, New York, NY, USA. Association for Computing Machinery.
  18. All you need is "love": Evading hate speech detection. In Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security, AISec ’18, page 2–12, New York, NY, USA. Association for Computing Machinery.
  19. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326, Dublin, Ireland. Association for Computational Linguistics.
  20. Learning the difference that makes a difference with counterfactually-augmented data
  21. Hate speech criteria: A modular approach to task-specific hate speech definitions. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH), pages 176–191, Seattle, Washington (Hybrid). Association for Computational Linguistics.
  22. Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4110–4124, Online. Association for Computational Linguistics.
  23. Aporophobia: An overlooked type of toxic language targeting the poor. In The 7th Workshop on Online Abuse and Harms (WOAH), pages 113–125, Toronto, Canada. Association for Computational Linguistics.
  24. Hatemoji: A test suite and adversarially-generated dataset for benchmarking and detecting emoji-based hate. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1352–1368, Seattle, United States. Association for Computational Linguistics.
  25. Exploiting explainability to design adversarial attacks and evaluate attack resilience in hate-speech detection models
  26. Overview of the hasoc track at fire 2020: Hate speech and offensive language identification in tamil, malayalam, hindi, english and german. In Proceedings of the 12th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE ’20, page 29–32, New York, NY, USA. Association for Computing Machinery.
  27. Overview of the hasoc track at fire 2019: Hate speech and offensive content identification in indo-european languages. In Proceedings of the 11th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE ’19, page 14–17, New York, NY, USA. Association for Computing Machinery.
  28. Umap: Uniform manifold approximation and projection for dimension reduction
  29. Pasquale Minervini and Sebastian Riedel. 2018. Adversarially regularising neural NLI models to integrate logical background knowledge. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 65–74, Brussels, Belgium. Association for Computational Linguistics.
  30. No "love" lost: Defending hate speech detection models against adversaries. In 2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM), pages 1–6.
  31. ETHOS: a multi-label hate speech detection dataset. Complex & Intelligent Systems.
  32. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
  33. Rajvardhan Oak. 2019. Poster: Adversarial examples for hate speech classifiers. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, CCS ’19, page 2621–2623, New York, NY, USA. Association for Computing Machinery.
  34. Playing the part of the sharp bully: Generating adversarial examples for implicit hate speech detection. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2758–2772, Toronto, Canada. Association for Computational Linguistics.
  35. GPT-4 Technical Report
  36. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  37. Resources and benchmark corpora for hate speech detection: A systematic review. Language Resources and Evaluation, 55(2):477–523.
  38. Perturbation augmentation for fairer NLP. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9496–9521, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  39. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  40. Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  41. Multilingual HateCheck: Functional tests for multilingual hate speech detection models. In Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH), pages 154–169, Seattle, Washington (Hybrid). Association for Computational Linguistics.
  42. Two contrasting data annotation paradigms for subjective NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 175–190, Seattle, United States. Association for Computational Linguistics.
  43. HateCheck: Functional tests for hate speech detection models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 41–58, Online. Association for Computational Linguistics.
  44. “call me sexist, but…” : Revisiting sexism detection using psychological scales and adversarial samples. Proceedings of the International AAAI Conference on Web and Social Media, 15(1):573–584.
  45. Rupak Sarkar and Ashiqur R. KhudaBukhsh. 2021. Are chess discussions racist? an adversarial hate speech data set (student abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 35(18):15881–15882.
  46. People make better edits: Measuring the efficacy of LLM-generated counterfactually augmented data for harmful language detection. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10480–10504, Singapore. Association for Computational Linguistics.
  47. Counterfactually augmented data and unintended bias: The case of sexism and hate speech detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4716–4726, Seattle, United States. Association for Computational Linguistics.
  48. Llama 2: Open Foundation and Fine-Tuned Chat Models
  49. Bertie Vidgen and Leon Derczynski. 2020. Directions in abusive language training data, a systematic review: Garbage in, garbage out. Plos one, 15(12):e0243300.
  50. Challenges and frontiers in abusive content detection. In Proceedings of the Third Workshop on Abusive Language Online, pages 80–93, Florence, Italy. Association for Computational Linguistics.
  51. Learning from the worst: Dynamically generated datasets to improve online hate detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1667–1682, Online. Association for Computational Linguistics.
  52. Analyzing dynamic adversarial training data in the limit. In Findings of the Association for Computational Linguistics: ACL 2022, pages 202–217, Dublin, Ireland. Association for Computational Linguistics.
  53. Detection of Abusive Language: the Problem of Biased Datasets. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 602–608, Minneapolis, Minnesota. Association for Computational Linguistics.
  54. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  55. Mastering the Dungeon: Grounded Language Learning by Mechanical Turker Descent
  56. Wenjie Yin and Arkaitz Zubiaga. 2021. Towards generalisable hate speech detection: A review on obstacles and solutions. PeerJ Computer Science, 7:e598. Publisher: PeerJ Inc.
  57. SWAG: A large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93–104, Brussels, Belgium. Association for Computational Linguistics.

Show All 57

Test Your Knowledge

You answered out of questions correctly.

Well done!