Papers
Topics
Authors
Recent
2000 character limit reached

ToxiGen: A Dataset for Implicit Toxicity

Updated 19 November 2025
  • ToxiGen is a large-scale, machine-generated dataset designed to detect subtle, implicit toxic language in minority group contexts, enabling fine-grained model evaluation.
  • It employs demonstration-based prompting and adversarial classifier-in-the-loop decoding to generate a balanced mix of benign and toxic statements.
  • The dataset supports robust finetuning of toxicity classifiers, leading to significant performance gains on both human-written and machine-generated test sets.

ToxiGen is a large-scale, machine-generated dataset designed to advance the detection of adversarial and implicit hate speech, particularly in contexts where existing toxicity classifiers conflate minority group mentions with toxicity or struggle to identify subtle, non-explicit forms of harm. Comprising 274,186 toxic and benign statements about 13 minority groups, ToxiGen leverages demonstration-based prompting and adversarial classifier-in-the-loop decoding to systematically surface implicit bias and adversarially-crafted toxic language, supporting both fine-grained evaluation and robust finetuning of toxicity classification models (Hartvigsen et al., 2022).

1. Dataset Composition and Coverage

ToxiGen consists of approximately equal proportions of toxic and benign statements (137,093 each), with 98.2% of all statements being implicit—lacking overt profanity, slurs, or explicit swearwords. The dataset focuses on 13 minority groups: Black, Asian, Native American, Latino, Jewish, Muslim, Chinese, Mexican, Middle Eastern, LGBTQ+, Women, Mental Disability, and Physical Disability.

The primary splits are as follows:

  • Training set: ~273,394 statements for model development, excluding the human-validated test set.
  • Human-validated test set (ToxiGen-HumanVal): 792 statements annotated by three independent annotators per item.
  • Adversarial subset (Alice): 14,174 examples generated by adversarial classifier-in-the-loop decoding to challenge toxicity classifiers.
  • Held-out large-scale validation: 8,960 statements subjected to additional human annotation.

A breakdown of group representation, statement counts, and implicit toxicity rates is detailed below.

Group Benign Count Toxic Count Avg. Chars (μ ± σ) % Implicit (Benign/Toxic)
Black 10,554 10,306 112.3 ± 40.1 99.3 / 96.2
Asian 10,422 10,813 93.0 ± 38.9 99.7 / 93.9
LGBTQ+ 11,596 10,695 111.4 ± 39.1 98.8 / 96.2
Women 11,094 10,535 63.9 ± 35.1 99.9 / 98.3
Mental Disability 10,293 10,372 107.9 ± 44.9 99.9 / 99.8
Physical Disability 10,319 10,645 89.4 ± 43.6 99.9 / 98.4

Across all groups, the dataset achieves an overall average statement length of approximately 90 characters (± 42). Alice-generated outputs are on average slightly longer (102.2 ± 33.1 characters) than top-k samples (88.0 ± 41.9 characters).

2. Data Generation Methodologies

ToxiGen employs two complementary machine generation protocols:

2.1 Demonstration-Based Prompting

To induce GPT-3 to generate either subtly toxic or benign statements regarding a specific minority group, the framework curates two pools (20–50 demonstrations each) of human-written examples per group: one for benign and one for implicitly toxic content. For each generation, five demonstrations are sampled, formatted as a continuation list using hyphens, and provided as a prompt to GPT-3. The process is human-in-the-loop, iteratively augmenting demonstration pools by manual inspection and selection of high-quality generated samples.

2.2 Adversarial Classifier-in-the-Loop Decoding ("Alice")

Designed to produce examples that deceive a pretrained toxicity classifier (e.g., HateBERT), Alice modifies beam search with a combined scoring function:

S(wi+1)=λLlogPLM(wi+1w0:i,prompt)+λClogPCLF(labelw0:i+1)S(w_{i+1}) = \lambda_L \log P_{LM}(w_{i+1} \mid w_{0:i}, prompt) + \lambda_C \log P_{CLF}(label \mid w_{0:i+1})

where PLMP_{LM} is the GPT-3 next-token probability, PCLFP_{CLF} is the classifier’s probability of the target class (benign or toxic), with λL=λC=0.5\lambda_L = \lambda_C = 0.5. Adversarial objectives include maximizing classifier misclassification rates:

  • False negatives: Encourage the classifier to label toxic text as benign.
  • False positives: Induce harmless text to be misclassified as toxic.

Generation parameters are beam size = 10, max length = 30 tokens, temperature = 0.9, restricting the vocabulary to the top 100 tokens and disallowing prompt token copying.

3. Human Annotation and Evaluation Protocol

Annotation utilizes Amazon Mechanical Turk with 156 pre-qualified annotators and 3 annotators per item. The ToxiGen-HumanVal set contains 792 GPT-3–generated statements selected to avoid high similarity (≥ 0.7 cosine) to training data. Annotators respond to seven-point batteries:

  • HumanOrAI: Binary judgment of authorship.
  • harmfulIfAI/harmfulIfHuman: 1–5 scale rating of harm/offensiveness.
  • harmfulIntent: Assessment of intent to harm.
  • PosStereo, Lewd, whichGroup, groupFraming, FactOrOpinion: Stereotype, sexual content, demographic references, framing, and factuality/opinion assessments.

Inter-annotator agreement yields Fleiss’ κ = 0.46 (moderate), Krippendorff’s α = 0.64, with 55.2% full agreement and 93.4% majority agreement.

Key findings:

  • 90.5% of machine-generated statements are mistaken for human-written (majority vote).
  • 94.5% of machine-generated toxic statements are labeled as hate speech by annotators.
  • 30.2% score >3/5 on harm; 4% are ambiguous (3/5).
  • Perceived authorship (human vs. AI) does not significantly affect toxicity ratings.

4. Characteristics of Implicit Toxicity and Diversity

98.2% of ToxiGen statements are implicit, containing no overt profanity, slurs, or explicit language. Explicit toxicity comprises only 1.8% of the corpus. Statement length varies by group and generation method, with Alice outputs tending to be longer and containing more complex constructions. A plausible implication is that the dataset’s focus on implicit expressions offers a more challenging and realistic benchmark for current classifiers, which often over-rely on explicit lexical cues.

5. Downstream Applications and Model Evaluation

ToxiGen supports evaluation and finetuning for toxicity classifiers such as HateBERT (OffensEval fine-tuned) and ToxDectRoBERTa. Performance is assessed on several external human-written and machine-generated test sets, including SocialBiasFrames (SBF_test), ImplicitHateCorpus (IHC), DynaHate, and ToxiGen-HumanVal. Representative AUC improvements for HateBERT fine-tuned on different ToxiGen subsets are:

Model / Fine-tune Data None (ZS) Alice only top-k only Alice + top-k
HateBERT on SBF_test 0.60 0.66 0.65 0.71
HateBERT on IHC 0.60 0.60 0.61 0.67
HateBERT on DynaHate 0.47 0.54 0.59 0.66
HateBERT on ToxiGen-Val 0.57 0.93 0.88 0.96

Fine-tuning on the combined Alice + top-k subsets results in the highest gains on human-written implicit toxicity tasks (AUC improvement of 7–19 points).

Alice attacks raise classifier confusion: 26.4% of adversarially-generated toxics fool HateBERT compared to 16.8% for top-k. Moreover, Alice can functionally detoxify outputs; for toxic prompts, human-perceived toxicity drops from mean 3.75 (top-k) to 2.97 (Alice), p<0.001p < 0.001.

Public code provides data loading, model finetuning, and inference pipelines. Pretrained HateBERT and RoBERTa models fine-tuned on ToxiGen are released for direct application.

6. Distribution, Access, and Licensing

The ToxiGen dataset, generation scripts, and model checkpoints are released for research use at https://github.com/microsoft/ToxiGen, with usage governed by the repository’s LICENSE file. Dataset columns include: prompt, generation, generation_method (Alice/top-k), prompt_label (0/1), group, and roberta_prediction. Practitioners may directly reproduce experimental pipelines, integrate ToxiGen into classifier training and benchmarking, and assess robustness to both human-written and machine-generated implicit toxicity (Hartvigsen et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ToxiGen Dataset.