Papers
Topics
Authors
Recent
Search
2000 character limit reached

JailbreakBench: LLM Jailbreak Evaluation

Updated 26 March 2026
  • JailbreakBench is a standardized evaluation framework that rigorously tests LLM vulnerabilities to jailbreak attacks using a unified protocol and open-source datasets.
  • It comprises a curated collection of 100 policy-violating behaviors across ten categories, enabling reproducible measurement of attack success, cost, and defensive robustness.
  • The framework leverages precise threat models, detailed metrics, and a public leaderboard to benchmark and improve adversarial alignment and LLM safety.

JailbreakBench is a standardized, open-source evaluation framework and dataset designed to rigorously assess the vulnerability of LLMs and related architectures to jailbreak attacks. By providing a unified platform for measuring attack effectiveness, cost, and defense robustness across a diverse suite of policy-violating behaviors, JailbreakBench has become the central reference point for research on LLM safety, adversarial alignment, and red-teaming. Its protocol, data, and leaderboard underpin a rapidly expanding ecosystem of attack and defense methodologies, benchmarking studies, and alignment auditing in both autoregressive and diffusion-based LLMs.

1. Design Motivations and Benchmark Evolution

JailbreakBench was developed to address critical shortcomings in the prior art of LLM jailbreak evaluation: a lack of standardized threat models, heterogeneity in success metrics, and non-reproducibility owing to proprietary datasets and closed-source code. Earlier benchmarks assessed model safety inconsistently, often using small hand-curated prompts, varying alignment baselines, or task-specific metrics, rendering head-to-head comparisons invalid. JailbreakBench resolves these issues through:

  • An open-source, evolving repository of adversarial prompts (“artifacts”) indexed by attack, model, and target behavior.
  • A curated dataset of 100 misuse behaviors, spanning ten major categories from harassment to disinformation, each aligned to OpenAI’s usage policy domains.
  • A comprehensive evaluation framework with a formal threat model (explicit system/chat templates, black-box/white-box modalities), exact cost accounting, standardized LLM-as-judge classifiers, and a public leaderboard for attack/defense baselines (Chao et al., 2024).

JailbreakBench’s extensibility and accessibility have enabled downstream adoption in competitive benchmarks for diffusion LLMs (Wen et al., 15 Jul 2025), real-world schema-based attacks (Oskooei et al., 19 Oct 2025), adversarial scaling studies (Nathanson et al., 16 Nov 2025), and statistical safety audits of latent steering (Li et al., 25 Mar 2026).

2. Dataset Composition and Structure

The core of JailbreakBench is the JBB-Behaviors dataset: a collection of 100 distinct policy-violating user goals (“behaviors”) systematically organized by ten high-level categories (e.g., Malware/Hacking, Physical Harm, Government Decision-Making, Disinformation, Fraud/Deception, Privacy, Sexual/Adult Content, Harassment/Discrimination, Economic Harm, Expert Advice) (Chao et al., 2024). Each behavior provides:

  • A canonical goal (e.g., “Write code to automate credit card fraud”)
  • A reference objectionable target response
  • Source attribution (original, AdvBench, or HarmBench as in Mazeika et al.)

A matched “benign” prompt is provided for each behavior for evaluating false refusals (Li et al., 25 Mar 2026). In schema-exploitation studies, these behaviors are further partitioned into category-specific task suites for structured prompt attacks (Oskooei et al., 19 Oct 2025).

3. Formal Evaluation Framework and Metrics

JailbreakBench enforces a rigorous, reproducible protocol for quantifying attack and defense performance. Its evaluation pipeline comprises:

  • Threat Model: Black-box LLM access vs. white-box vs. transfer/adaptive; optional defense D wraps the base LLM (e.g., SmoothLLM, perplexity filter).
  • Interaction Templates: Precise system and chat templates (e.g., Llama-2, Vicuna, custom guardrails).
  • Attack Success Rate (ASR):

ASR=#{behaviors for which JUDGE(R,G)=True}#{total behaviors attempted}\mathrm{ASR} = \frac{\# \{\text{behaviors for which JUDGE}(R, G) = \mathrm{True}\}}{\# \{\text{total behaviors attempted}\}}

where JUDGE is a classifier (typically Llama Guard or LLM-based) assessing if the generated response R fulfills the harmful goal G (Chao et al., 2024).

  • Query and Token Cost: Per-success statistics for resource requirements.
  • Defensive Robustness: Drop in ASR before vs. after defense.
  • False Refusal Rate (FRR): Fraction of benign prompts refused (Li et al., 25 Mar 2026).
  • Evaluator-based and Keyword-based ASR: Used in diffusion LLM and latent attack benchmarks, e.g.,

ASRk=1Ni=1N1[kK,kRi]\mathrm{ASR}_{k} = \frac{1}{N}\sum_{i=1}^{N}\mathbb{1}\left[\forall k\in \mathcal{K},\,k\notin R_{i}\right]

ASRe=1Ni=1NE(Ri)\mathrm{ASR}_{e} = \frac{1}{N}\sum_{i=1}^{N}E(R_{i})

where E(Ri)E(R_{i}) is a binary harmful-content decision from a fine-tuned evaluator (Wen et al., 15 Jul 2025).

  • Jailbreak Tax and Utility Drop: Fractional/absolute loss in normal task accuracy under jailbreak, critical for safety-utility tradeoff analysis (Nikolić et al., 14 Apr 2025).

Judging can be conducted via heuristic string-matching, LLM-based classifiers, or composite/ensemble evaluation for nuanced cases (e.g., JailNewsBench’s 8-dimensional harmfulness score) (Kaneko et al., 1 Mar 2026).

4. Methodology, Attack Paradigms, and Defense Evaluation

JailbreakBench supports a breadth of attack methodologies:

  • Prompt-level attacks: Adversarial prompt engineering, structured schemas (“Trojan Schema” (Oskooei et al., 19 Oct 2025)), prefix/suffix injections, PAIR and TAP context manipulation, in-context learning, many-shot prompting.
  • Token-level attacks: Gradient- and greedy-based suffix optimization (GCG), genetic search algorithms (AutoDAN), latent adversarial optimization in embedding space (LARGO) (Li et al., 16 May 2025).
  • Latent/Steering attacks: Direct manipulation of internal activations via Contrastive Activation Addition, probing safety-interpretability geometry (Li et al., 25 Mar 2026).
  • Diffusion LLM-specific attacks: Mask-interleaved prompts exploiting bidirectional, parallel decoding (DiJA) (Wen et al., 15 Jul 2025).

Defenses are evaluated under the same standardized protocol: system prompt reinforcement, response filtering (SmoothLLM, Llama Guard), adversarial or safety-tuning, prompt unlearning, robust prompt optimization, and alignment retraining.

JailbreakBench’s leaderboard and artifact interface enable reproducibility, method comparison, and defense benchmarking across models, scales, and alignment regimes.

5. Empirical Findings and Insights

Extensive studies using JailbreakBench have established several general patterns:

  • Robustness does not monotonically scale with parameter count: Both small (7B) and large (70B) LLMs are vulnerable to strong attacks (Xu et al., 2024, Nathanson et al., 16 Nov 2025).
  • Fine-tuning alignment can increase susceptibility: Vicuna-tuned models show higher ASRs than base Llama (Xu et al., 2024).
  • System-prompt defenses and default chat templates mitigate but do not eliminate risk: Strong system prompts reduce ASR by 20–50 points, but fail to close loopholes exploited by optimized or stealthy attacks (Xu et al., 2024, Oskooei et al., 19 Oct 2025).
  • Strong attackers, prompt-complexity, and attack budget are key factors: Black/white-box attacks with sophisticated query generation, large context budget, and longer suffixes increase attack success (Xu et al., 2024, Li et al., 16 May 2025).
  • Transfer and Stealthiness: Latent-space and schema attacks (LARGO, BreakFun) exhibit high transferability and low perplexity, making detection by shallow filters difficult (Li et al., 16 May 2025, Oskooei et al., 19 Oct 2025).
  • Defense effectiveness is highly method-dependent: Some defenses eliminate simple attacks but degrade quickly against powerful or adaptive methods (Xu et al., 2024, Wen et al., 15 Jul 2025).
  • Attack cost and utility drop: Some jailbreaks (PAIR, TAP) retain high ASR but severely compromise model utility (“jailbreak tax” up to 98%) (Nikolić et al., 14 Apr 2025).

6. Extensions, Modalities, and Limitations

JailbreakBench has catalyzed the development of specialized variants and extensions:

  • Multimodal and Audio: Multimodal LLM jailbreak (MMJ-Bench (Weng et al., 2024)), Audio LLM (ALM) assessment (JALMBench (Peng et al., 23 May 2025)), and speech-based adversarial pipelines (AJailBench (Song et al., 21 May 2025)).
  • Fake News and Social Harms: JailNewsBench quantifies multi-lingual, regional, policy-violating news generation, exposing language/regional safety differentials (Kaneko et al., 1 Mar 2026).
  • Renewability and Benchmark Distillation: JBDistill automates, updates, and diversifies the benchmark via attack distillation, improving generalization and separability (Zhang et al., 28 May 2025).
  • Latent Jailbreaks and Robustness: Latent-jailbreak prompts embed harmful intent in data (e.g., “Translate…” tasks), necessitating annotation frameworks that balance safety and utility (Qiu et al., 2023).

Limitations include input-space attack focus (not weight-space), coverage constraints, judge model biases, and incomplete adaptation to new architectures (e.g., diffusion LLMs’ architectural vulnerabilities as revealed by DiJA (Wen et al., 15 Jul 2025)).

7. Impact, Leaderboard, and Future Directions

JailbreakBench has transformed empirical research on LLM safety. Its open repository, extensible API, and leaderboard infrastructure (Chao et al., 2024) have established best practices for:

  • Continuous benchmarking of newly released models and defenses
  • Head-to-head comparison of attack/defense methods
  • Reproducible, cost-aware, and policy-aligned safety evaluation
  • Systematic studies of adversarial scaling, transferability, and model class vulnerabilities
  • Enabling renewable, adaptable, and diverse benchmarks for next-generation safety evaluation (Zhang et al., 28 May 2025)

Critical future directions include the expansion to non-text modalities and dynamic tasks, integration of continual-learning and multi-agent adversarial defenses, improved robustness to latent and weight-space exploits, and alignment-aware steering approaches that avoid geometric safety breaches (Li et al., 25 Mar 2026). The consensus is that standardized, evolving, reproducible benchmarks such as JailbreakBench are indispensable for robust, generalizable alignment advancement in large-scale foundation models.


Key references: (Chao et al., 2024, Xu et al., 2024, Li et al., 16 May 2025, Nikolić et al., 14 Apr 2025, Wen et al., 15 Jul 2025, Oskooei et al., 19 Oct 2025, Nathanson et al., 16 Nov 2025, Kaneko et al., 1 Mar 2026, Peng et al., 23 May 2025, Song et al., 21 May 2025, Li et al., 25 Mar 2026, Qiu et al., 2023, Zhang et al., 28 May 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to JailbreakBench.