RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models (2009.11462v2)

Published 24 Sep 2020 in cs.CL

Abstract: Pretrained neural LLMs (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. We create and release RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier. Using RealToxicityPrompts, we find that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts. We empirically assess several controllable generation methods, and find that while data- or compute-intensive methods (e.g., adaptive pretraining on non-toxic data) are more effective at steering away from toxicity than simpler solutions (e.g., banning "bad" words), no current method is failsafe against neural toxic degeneration. To pinpoint the potential cause of such persistent toxic degeneration, we analyze two web text corpora used to pretrain several LMs (including GPT-2; Radford et. al, 2019), and find a significant amount of offensive, factually unreliable, and otherwise toxic content. Our work provides a test bed for evaluating toxic generations by LMs and stresses the need for better data selection processes for pretraining.

PDF Abstract

Evaluating Toxic Language Generation in Neural LLMs

The paper "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in LLMs" addresses crucial issues related to the unintended generation of toxic language by pretrained neural LLMs (LMs). This research systematically investigates both the triggers and mitigation strategies for toxic language generation. The paper stands out by introducing RealToxicityPrompts, a dataset comprising 100K sentence-level prompts paired with toxicity scores. This resource is employed to evaluate the behavior of popular LMs under various conditions. Furthermore, the paper presents empirical findings on the effectiveness of different detoxification techniques.

Dataset and Methodology

The RealToxicityPrompts dataset is a significant contribution from this work. It consists of 100K naturally occurring English prompts derived from a large corpus of web text. Each prompt is associated with a toxicity score obtained from Perspective API. The dataset includes both toxic and non-toxic prompts, enabling the comprehensive evaluation of LMs under diverse inputs.

Several pretrained LMs are evaluated using this dataset, namely GPT-1, GPT-2, GPT-3, CTRL, and CTRL-Wiki. The assessment focuses on determining how these models, when conditioned on the provided prompts, generate toxic content. The paper also examines the generation of toxic content from a neutral starting point to understand the models' baseline propensities for toxic output.

Key Findings

Baseline Toxicity: Even without potentially toxic prompts, LMs can generate toxic text. For instance, GPT-2's unprompted generations can reach an expected maximum toxicity of 0.65 with just 100 generations, underscoring inherent risks in their use.
Prompted Toxicity: Conditioning LMs on non-toxic prompts can still lead to toxic outputs. Approximately half of all evaluated non-toxic prompts resulted in at least one toxic generation among 25 sampled generations. This finding highlights challenges in ensuring completely safe deployments of LMs.
Mitigation Strategies:
- Data-based Detoxification: Techniques like Domain-Adaptive Pretraining on non-toxic data (DAPT) showed promising results, reducing both the probability and severity of toxic generations significantly. However, extending pretraining alone was inadequate for comprehensive detoxification.
- Decoding-based Detoxification: Various methods such as vocabulary shifting, word filtering, and Plug-and-Play LLMs (PPLM) were evaluated. PPLM emerged as the most effective among decoding approaches, particularly under partially toxic or toxic prompts.

Analysis of Training Data

To understand root causes, the paper explores the toxicity within two major corpora: OpenAI-WT (used for GPT-2) and its open-source counterpart, OpenWebText Corpus (OWTC). Substantial portions of these corpora contained toxic, biased, and unreliable content. Additionally, the data's provenance analysis revealed significant shares from banned or unreliable news sources and online communities, raising concerns over the integrity of LMs' pretraining datasets.

Implications and Future Directions

The paper conveys several implications and recommendations for the future development and deployment of LMs:

Improving Transparency: Releasing comprehensive metadata about the pretraining data can shed light on LMs’ behavior and improve trust in these technologies.
Data Selection: Applying rigorous data curation strategies, potentially building on toxicity-free and reliable sources, is necessary to mitigate inherent biases and toxic behavior from pretrained LMs.
Advanced Steering Methods: Exploration of more sophisticated generation control mechanisms, possibly leveraging multi-dimensional bias detection systems or adaptive generation techniques that can dynamically respond to toxicity threats, are critical.
Human-Centric AI Development: Engaging diverse stakeholders and communities in the design and deployment of LMs can help align these technologies’ capabilities with ethical and societal expectations, reducing disparate impacts, particularly on marginalized groups.

Conclusion

The paper "RealToxicityPrompts: Evaluating Neural Toxic Degeneration in LLMs" offers a rigorous analytical framework for assessing and mitigating toxic generations from LMs. By combining empirical evaluation with thorough data analysis, the paper not only elucidates the challenges inherent in using current LMs but also paves the way for more robust and responsible AI. The dataset and findings herein will be invaluable for ongoing and future research focused on aligning AI technologies with human values and safety requirements.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Samuel Gehman (2 papers)
Suchin Gururangan (29 papers)
Maarten Sap (86 papers)
Yejin Choi (287 papers)
Noah A. Smith (224 papers)

Citations (1,017)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos