ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection (2203.09509v4)

Published 17 Mar 2022 in cs.CL

Abstract: Toxic language detection systems often falsely flag text that contains minority group mentions as toxic, as those groups are often the targets of online hate. Such over-reliance on spurious correlations also causes systems to struggle with detecting implicitly toxic language. To help mitigate these issues, we create ToxiGen, a new large-scale and machine-generated dataset of 274k toxic and benign statements about 13 minority groups. We develop a demonstration-based prompting framework and an adversarial classifier-in-the-loop decoding method to generate subtly toxic and benign text with a massive pretrained LLM. Controlling machine generation in this way allows ToxiGen to cover implicitly toxic text at a larger scale, and about more demographic groups, than previous resources of human-written text. We conduct a human evaluation on a challenging subset of ToxiGen and find that annotators struggle to distinguish machine-generated text from human-written language. We also find that 94.5% of toxic examples are labeled as hate speech by human annotators. Using three publicly-available datasets, we show that finetuning a toxicity classifier on our data improves its performance on human-written data substantially. We also demonstrate that ToxiGen can be used to fight machine-generated toxicity as finetuning improves the classifier significantly on our evaluation subset. Our code and data can be found at https://github.com/microsoft/ToxiGen.

PDF Abstract

ToxiGen: Enhancing Toxicity Detection with a Machine-Generated Dataset

The academic paper "ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection" presents a novel approach to generating a comprehensive dataset focused on enhancing the detection of implicit and adversarial hate speech. This paper is a response to the challenge that current toxicity detection systems confront, where these systems frequently misidentify texts mentioning minority groups as toxic due to spurious correlations, failing to consider implicit toxicity that doesn't involve explicit hate speech markers like profanity or slurs.

Dataset and Methodology

ToxiGen comprises 274,186 entries that combine toxic and benign examples regarding 13 minority groups, created through a machine-driven approach leveraging the GPT-3 model. The authors deploy a demonstration-based prompting technique along with a classifier-in-the-loop decoding strategy dubbed "Alice" (Adversarial Language Imitation with Constrained Exemplars) to construct sentences that either subvert or mask toxic content from the perspective of standard classifiers.

The paper distinguishes itself by balancing between the production of toxic and benign statements for each demographic, ensuring that prior biases in databases scraping from online platforms, which often skew towards toxicity when minority mentions occur, are appropriately alleviated. The dataset particularly emphasizes implicit types of hate speech, which lack clear hate markers, and reports that 98.2% of ToxiGen's sentences are implicit.

Evaluation and Findings

Human evaluation is conducted to test whether annotators can discern machine-generated text from human-written statements, complemented by a toxicity classification task to assess effectiveness. It is found that annotators struggle to distinguish between the two, confusing the machine-generated text for human writing 90% of the time. Additionally, the studies demonstrate that fine-tuning classifiers on the ToxiGen dataset significantly improves their performance, marking enhancements between 7-19% on well-established human-generated datasets.

Implications and Outlook

This work has substantive implications as it provides avenues for improving the performance of existing classifiers and the way they interpret nuanced language relating to minority groups. This ensures that classifiers are versatile not just in catching overt hate but also in navigating the subtleties of implicit content. Implementing this can lead to better moderation systems on online platforms, potentially attenuating issues related to censorship and marginalization of minority voices by falsely classifying benign content as toxic.

Furthermore, the research recognizes the systemic bias in LLMs, proposing that controlled machine generation can bootstrap better toxicity detection. The open code and dataset release promote further research and replication, inviting enhancements to the methodologies employed. Future explorations could delve into incorporating more diverse groups, enhancing the quality and control of generated text, and comprehensively integrating human perspectives to address subjective determinatives on hate speech.

Conclusion

"ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection" offers a significant contribution to the paper and improvement of hate speech detection systems. Through a sophisticated combination of machine learning methods, it better explores and addresses the complexities of implicit toxic language and puts forth forward-looking applications integral for both AI and societal considerations. This methodological framework and dataset establish an innovative benchmark for subsequent toxic language detection research, providing a foundation to better workplace inclusivity and moderation outcomes aligned with socio-technical governance strategies.