Papers
Topics
Authors
Recent
Search
2000 character limit reached

WildJailbreak Synthetic Safety Dataset

Updated 4 July 2026
  • WildJailbreak is a large-scale open-source synthetic safety dataset featuring 262K vanilla and adversarial prompt-response pairs organized via the WildTeaming framework.
  • It employs a four-cell contrastive design separating harmful versus benign content to support precise evaluation of jailbreak resistance and calibration of responses.
  • WildJailbreak serves as a supervised safety training resource that has demonstrated significant improvements in safety metrics, such as reducing Attack Success Rate and toxicity levels.

WildJailbreak is a large-scale open-source synthetic safety dataset introduced through the WildTeaming framework to support both jailbreak evaluation and safety training for LLMs. It contains 262K vanilla and adversarial prompt-response pairs and is organized around a contrastive design: harmful queries, including direct and complex jailbreak variants, and benign queries that resemble harmful queries in form but contain no harm. This construction makes WildJailbreak useful not only for measuring jailbreak resistance, but also for studying over-refusal, calibration, and the interaction between safety tuning and general capability (Jiang et al., 2024).

1. Emergence from WildTeaming

WildJailbreak originates in WildTeaming, an end-to-end framework that mines real-world jailbreak tactics from in-the-wild user–chatbot logs and composes those tactics into systematically diverse adversarial attacks. In the Mine stage, 16.8K in-the-wild adversarial user queries were gathered from LMSYS-Chat-1M and WildChat by filtering for harmful flags with the OpenAI Moderation API and verifying that a lightly safety-trained model, Tulu2-7B, nevertheless produced a harmful response under a Llama-Guard classifier. From a manually inspected seed set of approximately 200 queries, 35 seed jailbreak tactics were distilled. GPT-4 was then used to simplify each adversarial prompt to its core vanilla harmful request and to extract all tactics present, yielding 105K tactic instances (Jiang et al., 2024).

These tactic definitions were deduplicated via sentence embeddings from Nomic-Embed with a threshold of 0.75, resulting in K=5,688K=5{,}688 unique clusters. The average number of tactics per query was 6.26. Figure-level analysis reported in the same work shows that the top-20 clusters account for only about 20% of tactics, which indicates a pronounced long-tail structure in the attack space (Jiang et al., 2024).

This provenance is consequential. WildJailbreak was not assembled from a small set of hand-written attack templates; it was derived from user behavior and then systematized. A plausible implication is that the dataset captures both common jailbreak motifs and a broader tail of paraphrastic, narrative, stylistic, and role-play tactics that are difficult to cover with static rule lists alone.

2. Dataset structure and contrastive design

WildJailbreak’s core organization is a four-cell contrastive matrix spanning harmful versus benign content and vanilla versus adversarial phrasing. The full dataset statistics reported in WildTeaming are as follows (Jiang et al., 2024):

Split Symbol Size
Vanilla harmful QVHQ_{VH} 50,050
Vanilla benign QVBQ_{VB} 50,050
Adversarial harmful QAHQ_{AH} 82,728
Adversarial benign QABQ_{AB} 78,706

The adversarial-to-vanilla ratio is reported as

Radv/vanilla=QAH+QABQVH+QVB1.61.R_{\text{adv/vanilla}}=\frac{|Q_{AH}|+|Q_{AB}|}{|Q_{VH}|+|Q_{VB}|}\approx 1.61.

Distributionally, the dataset is approximately 19.2%19.2\% vanilla harmful, 19.2%19.2\% vanilla benign, 31.6%31.6\% adversarial harmful, and 30.0%30.0\% adversarial benign (Jiang et al., 2024).

The adversarial prompts are produced in the Compose stage by sampling subsets QVHQ_{VH}0 of 2–7 tactics from the top 500 most frequent tactic clusters and prompting an off-the-shelf LLM, specifically Mixtral-8×7B or GPT-4, to rewrite a vanilla prompt QVHQ_{VH}1 into an adversarial prompt QVHQ_{VH}2. Low-risk and off-topic candidates are pruned via lightweight classifiers, using a prompt harmfulness classifier and an NLI-based off-topic filter (Jiang et al., 2024).

The contrastive labeling scheme is central to the dataset’s design. WildTeaming defines four types of queries: direct harmful requests, safe but superficially similar benign requests, adversarial rewrites of harmful prompts, and adversarial rewrites of benign prompts. Vanilla harmful and benign prompts were generated by GPT-4 with in-context examples covering 13 risk subcategories and 10 XSTest exaggeration categories, then paired with detailed refusal or helpful completions from GPT-3.5 (Jiang et al., 2024). This means the benchmark is explicitly structured to penalize both under-refusal on harmful requests and exaggerated refusal on benign near-misses.

Reported length statistics reinforce the distinction between vanilla and adversarial regimes. Average prompt length is approximately 15 tokens for vanilla prompts and approximately 85 tokens for adversarial prompts; average refusal length is approximately 120 tokens, and average safe completion length is approximately 50 tokens (Jiang et al., 2024).

3. Benchmark semantics, slices, and judging protocols

WildJailbreak is used with multiple evaluation semantics in later work. The most common metric is Attack Success Rate (ASR), defined as the fraction of harmful prompts on which the model complies or otherwise fails the safety check. Lower ASR indicates stronger jailbreak resistance. Several works also evaluate benign behavior: Refusal-to-Answer (RTA) on benign prompts, benign compliance rate, or a harmonic mean combining safety and helpfulness (Kim et al., 12 Dec 2025, Thomas et al., 5 Feb 2026).

Different studies instantiate the benchmark with different judges. One role-conditioning study evaluates WildJailbreak with a fine-tuned Llama2-13B classifier that assigns “Attack Success” when the model reveals disallowed content or instructions (Ziheng et al., 20 Jan 2026). SafeChain adopts Llama-Guard after calibrating it against human annotations, reporting 88.2% Accuracy, 86.1% F1, and QVHQ_{VH}3 on its calibration task (Jiang et al., 17 Feb 2025). Self-Mined Hardness uses a three-judge safety ensemble with majority vote for harmfulness labels (Gupta et al., 4 May 2026). RECAP measures safety score as the percentage of completions judged safe by GPT-4o (Peng et al., 1 Oct 2025). This suggests that WildJailbreak functions less as a single immutable leaderboard and more as a shared corpus supporting multiple judging and slicing conventions.

Several papers define specialized subsets. ShieldLearner introduces a “hard mode” WildJailbreak evaluation set QVHQ_{VH}4 with QVHQ_{VH}5 prompts, partitioned into QVHQ_{VH}6 malicious and QVHQ_{VH}7 benign prompts. The malicious subset is manually categorized into Hidden-Intent, Indirect-Wording, and Ambiguous-Context prompts, emphasizing cases in which malicious objective is concealed or inferable only through deeper semantic parsing (Ni et al., 16 Feb 2025). SafeChain, by contrast, reports a random sample of 250 distinct malicious prompts from the full WildJailbreak corpus and holds out 50 for a small split in non-deterministic QVHQ_{VH}8-shot evaluation (Jiang et al., 17 Feb 2025). ProMoral-Bench evaluates model–strategy combinations on approximately 430 total WildJailbreak prompts, consisting of approximately 230 harmful adversarial requests and approximately 200 benign requests, with regex-based filtering, a secondary LLM judge, and human audits for disagreement or borderline cases (Thomas et al., 5 Feb 2026).

The benchmark has also been adapted beyond its original single-turn formulation. A proxy-layer multi-turn detection paper isolates 82K prompts labeled adversarial at the L4 jailbreak level, extracts 5,529 unique injection sentences, and composes 579 WildJailbreak-sourced attack conversations plus 9 handcrafted edge-case sequences. Those conversations are evaluated alongside 10,000 organic benign conversations from WildChat (Corll, 11 Feb 2026). A plausible implication is that WildJailbreak now serves both as a canonical prompt corpus and as raw material for derived, structurally richer safety benchmarks.

4. WildJailbreak as a safety-training resource

WildJailbreak was designed not only for evaluation but also for supervised safety tuning. In the original WildTeaming experiments, Llama 2 7B was fine-tuned on a 500K-example mixture of Tulu2Mix-no-refusal and 200K evenly sampled WildJailbreak examples from the four splits. On reported evaluations, adding WildJailbreak preserved general capability while improving safety: the Tulu2Mix-only model had MMLU of approximately 49.5, AlpacaEval V1 of approximately 75.9%, and MT-Bench of approximately 5.84, whereas the model trained with WildJailbreak had MMLU of approximately 49.7, AlpacaEval V1 of approximately 74.6%, and MT-Bench of approximately 6.29 (Jiang et al., 2024).

The same study reports substantial safety gains. HarmBench DirectRequest ASR decreased from 59.1% to 3.1%, ToxiGen toxicity decreased from 35.0% to 0.2%, Do-Anything-Now ASR decreased from 66.0 to 14.0, JailbreakTrigger refusal increased from 60.0 to 86.8, and WildJailbreak eval H ASR decreased from 71.0 to 1.7 (Jiang et al., 2024). Ablation results further indicate that only the full four-cell composition QVHQ_{VH}9 achieved the best balance; omitting any quadrant degraded either over-refusal or vulnerability (Jiang et al., 2024).

Scaling analyses in WildTeaming are particularly important for understanding the dataset’s role. Even 2K WildJailbreak items reduced adversarial ASR from approximately 40% to approximately 20%, while robust safety above 95% required on the order of 40–60K mixed examples when combined with 150K general instruction data (Jiang et al., 2024). This provides a concrete estimate of data scale required for high-coverage adversarial safety training in the reported setting.

Subsequent work has explored more model-specific training regimes using WildJailbreak-derived hardness. Self-Mined Hardness scores each candidate prompt by the fraction of a target model’s rollouts judged harmful and fine-tunes on the hardest eligible prompts paired with the model’s own safe rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach reduced WildJailbreak ASR from 11.5% and 20.1% down to 1–3%, but increased refusal on jailbreak-shaped benign prompts from 14–22% to 74–94%. Interleaving the same hard prompts 1:1 with adversarially framed benign prompts reduced that refusal to 30–51% on 8B and 52–72% on 3B, at a cost of 2–6 percentage points of ASR (Gupta et al., 4 May 2026). This result makes explicit a trade-off already implicit in WildJailbreak’s contrastive design: safety training on adversarial harmful prompts alone is insufficient if benign adversarial lookalikes are not also modeled.

5. Stress-testing modern defense paradigms

WildJailbreak has become a common evaluation substrate for prompt-based, RL-based, inference-time, and hybrid safety systems. Representative results span multiple defense paradigms.

ShieldLearner defines a defense architecture around a Pattern Atlas, a rule-based meta-analysis framework, and Adaptive Adversarial Augmentation. On its WildJailbreak hard-mode test set, ShieldLearner reports, for GPT-3.5-turbo, ASR QVBQ_{VB}0, FPR QVBQ_{VB}1, and time QVBQ_{VB}2, compared with a best baseline ASR of approximately QVBQ_{VB}3, FPR of approximately QVBQ_{VB}4, and time of approximately QVBQ_{VB}5. For GPT-4o, it reports ASR QVBQ_{VB}6, FPR QVBQ_{VB}7, and time QVBQ_{VB}8, compared with a best baseline ASR of approximately QVBQ_{VB}9, FPR of approximately QAHQ_{AH}0, and time of approximately QAHQ_{AH}1. The paper reports QAHQ_{AH}2ASR over the second-best baseline of QAHQ_{AH}3 percentage points for GPT-3.5 and QAHQ_{AH}4 percentage points for GPT-4o, with throughput approximately QAHQ_{AH}5 faster than G4D (Ni et al., 16 Feb 2025).

RECAP treats WildJailbreak-style attacks as adversarial prompts seeded with flawed chain-of-thought prefixes and trains large reasoning models to override those prefixes through reinforcement learning. On DSLlama-8B and DSQwen-14B, RECAP improves WildJailbreak safety by QAHQ_{AH}6 and QAHQ_{AH}7 relative to vanilla DAPO, while maintaining statistically identical total token budgets within QAHQ_{AH}8. Under iterative prefill reset with QAHQ_{AH}9, RECAP safety remains above 97%, versus 70% for DAPO (Peng et al., 1 Oct 2025).

Self-ReSET pushes the reasoning-centric line further by replaying the model’s own unsafe reasoning prefixes and training recovery from those states. On DS-Qwen-7B, Self-ReSET reports WildJailbreak defense success rate of 91.3, compared with 74.9 for RECAP and 48.1 for the base model, while maintaining XSTest compliance rate of 96.4 and Math score of 52.9. Recovery-rate analysis on WildJailbreak shows 67% for Self-ReSET, compared with 65% for RECAP, 61% for DAPO, and 35% for the base model (Zhang et al., 9 May 2026).

Prompt-only methods also use WildJailbreak as a decisive benchmark. A role-conditioning pipeline reports that, on DeepSeek-V3, unsafe outputs on WildJailbreak drop from 81.4% for the base model to 3.6% with a two-role pipeline plus iterative critics; the same study reports 59.0% ASR for generator-only role conditioning, 53.2% for principle-based generation, 32.0% for principle-based plus critic, and 33.0% for 6-shot CoT (Ziheng et al., 20 Jan 2026). CONTEXTLENS, which learns to extract a context snippet from the prompt itself, reports on GPT-4o that ASR decreases from 52.65% to 34.35%, compliance decreases from 99.05% to 96.19%, and the harmonic mean QABQ_{AB}0 increases from 64.07% to 78.04% (Kim et al., 12 Dec 2025).

Inference-time candidate selection has likewise been evaluated on WildJailbreak. In deliberative alignment with latent-attribution Best-of-QABQ_{AB}1 sampling, averaged over 42 teacher–student pairs, post-SFT single-sample WildJailbreak ASR is approximately 45%, while BoN at Layer 12 reduces it to approximately 31%, a 31.3% drop; post-RL single-sample ASR is approximately 35%, while BoN at Layer 12 reduces it by 35.3% (Pathmanathan et al., 1 Apr 2026). Prompt-engineering benchmarks reinforce the same pattern from another angle: ProMoral-Bench evaluates 11 prompting paradigms on WildJailbreak and reports, for example, GPT-4.1 ASR/RTA pairs of QABQ_{AB}2 for Thought Experiment and QABQ_{AB}3 for Role Prompting, while Claude Sonnet-4 achieves QABQ_{AB}4 for Value-Grounded prompting (Thomas et al., 5 Feb 2026).

6. Extensions, limitations, and interpretive issues

WildJailbreak’s later uses expose both its flexibility and its methodological complications. The dataset has been reinterpreted for chain-of-thought safety, multi-turn prompt injection, prompt-engineering evaluation, and hard-mode concealment testing. In multi-turn proxy detection, for example, a peak-plus-accumulation scoring formula evaluated on 10,654 conversations—588 attacks sourced from WildJailbreak adversarial prompts and 10,066 benign conversations from WildChat—achieves 90.8% recall at 1.20% false positive rate with an F1 of 85.9%; on the 579 WildJailbreak-sourced attacks specifically, 525, or 90.7%, are correctly flagged at QABQ_{AB}5 and QABQ_{AB}6 (Corll, 11 Feb 2026).

Chain-of-thought studies highlight a further dimension. SafeChain evaluates 12 large reasoning models on WildJailbreak and reports that breach rates QABQ_{AB}7 exceed 30% for nearly all models; on the benchmark, even R1-70B attains only 67.2% greedy Safe@1 and 71.6% non-deterministic Safe@1. The same study reports a thought-versus-answer contingency on WildJailbreak in which safe thought and safe answer occur 49.0% of the time, safe thought with unsafe answer 6.5%, unsafe thought with safe answer 8.1%, and unsafe thought with unsafe answer 36.5% (Jiang et al., 17 Feb 2025). This suggests that WildJailbreak probes not only final-answer refusal behavior but also latent reasoning safety under adversarial prompting.

Several limitations emerge from downstream usage. ShieldLearner’s hard mode requires manual curation of adversarial prompts and manual threshold setting for QABQ_{AB}8 and QABQ_{AB}9 (Ni et al., 16 Feb 2025). Role-conditioning notes that performance degrades on weaker models and remains untested in very long-context dialogues (Ziheng et al., 20 Jan 2026). Self-Mined Hardness shows that optimizing aggressively for ASR can sharply increase refusal on jailbreak-shaped benign prompts (Gupta et al., 4 May 2026). More broadly, because later papers use different judges—fine-tuned classifiers, Llama-Guard, WildGuard, GPT-4o, or three-judge ensembles—cross-paper WildJailbreak numbers are not directly interchangeable. A plausible implication is that the benchmark’s continued value lies as much in its adversarial data distribution and contrastive design as in any single headline metric.

Taken together, WildJailbreak occupies a dual role in contemporary LLM safety research. It is a training corpus built from real-user jailbreak tactics and a benchmark family that now includes single-turn, multi-turn, hard-mode, prompt-engineering, and chain-of-thought variants. Its defining contribution is the combination of scale, tactic diversity, and adversarial-benign contrast, which makes it possible to study both robust refusal and calibrated non-refusal within the same experimental substrate (Jiang et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WildJailbreak.