JailBreakV-28K: Multimodal Model Safety Benchmark

Updated 28 July 2025

JailBreakV-28K Benchmark is a large-scale evaluation suite that rigorously tests the safety alignment of multimodal LLMs against advanced jailbreak attacks.
It employs 28,000 adversarial cases combining text and image inputs to measure vulnerabilities, with an average Attack Success Rate around 50% for text-based methods.
The benchmark reveals that inherited text vulnerabilities persist in MLLMs, underscoring the necessity for integrated defense strategies across both modalities.

JailBreakV-28K Benchmark is a large-scale, multi-scenario evaluation suite designed to assess the robustness and safety alignment of multimodal LLMs (MLLMs) under advanced jailbreak attacks. The benchmark systematically probes vulnerabilities exposed by adversarial textual and visual prompts, with a specific focus on the transferability of LLM-oriented jailbreak techniques to models that integrate both language and vision modalities (Luo et al., 3 Apr 2024).

1. Definition and Objective

JailBreakV-28K is constructed to evaluate MLLMs' resistance to adversarial inputs that coerce the model into generating unsafe, harmful, or otherwise misaligned responses. Its distinguishing objective is to rigorously test whether jailbreak methods designed for pure-text LLMs remain effective when applied in multimodal contexts, thus exposing inherited vulnerabilities in state-of-the-art MLLMs.

The benchmark quantitatively measures the Attack Success Rate (ASR), formally:

$\mathrm{ASR}_J(D') = \frac{1}{|D'|} \sum_{Q' \in D'} \mathrm{isSuccess}_J(Q')$

where $D'$ is the set of adversarial test cases and $\mathrm{isSuccess}_J(\cdot)$ indicates whether a model's response to $Q'$ is classified as unsafe.

2. Dataset Structure and Generation

The JailBreakV-28K dataset comprises 28,000 adversarial test cases, partitioned as follows:

Subset Category	Number of Cases	Content Type
Text-based (LLM transfer)	20,000	Text-only jailbreak prompts, paired with images
Image-based (MLLM attacks)	8,000	Advanced adversarial images for MLLMs

The text-based inputs are generated from 2,000 curated malicious queries (RedTeam-2K), encompassing 16 distinct safety categories (e.g., illegal acts, violence, hate speech, malware, economic harm). These base queries undergo transformation through advanced jailbreak methodologies:

Template-based attacks: insertion of real-world jailbreak prompt templates, greedy coordinate gradient methods.
Logic-based (cognitive overload): structural and lexical manipulations for increased adversarial complexity.
Persuasive adversarial prompts: paraphrasing benign queries to implicit or overtly dangerous ones using automated techniques.

Effective jailbreak samples are probed against multiple LLMs (Llama-2-chat, Vicuna, Qwen1.5, etc.), then further paired with four types of images—blank, random noise, natural images (ImageNet-2K), and synthetic images (stable diffusion)—to enumerate the multimodal inputs for MLLMs. The remaining 8,000 image-based cases compile outputs from state-of-the-art image-specific jailbreak attacks such as FigStep and Query-Relevant, including advanced typography and synthesis variants.

3. Methodology and Evaluation Protocol

The evaluation methodology broadly emphasizes both attack-side and defense-side granularity:

Attack strategy diversity: Applying methods in three core categories—template, cognitive/logical, and persuasive adversarial attacks.
Multimodality: Each text-based case is injected alongside different image types to test robustness across modalities.
Automated safety scoring: Responses are evaluated using Llama Guard, which operationalizes 16 safety policies mapped onto content categories (e.g., hate speech, economic harm).

The principal metric, $\mathrm{ASR}_J$ , reflects the proportion of attacks that successfully induce harmful output, as detected by the automated safety evaluator. This is supplemented with granular error analysis by attack category, image pairings, and content domain.

4. Key Findings and Empirical Results

The systematic evaluation yields several notable findings:

High ASR for MLLMs: MLLMs show an average ASR of 50.5% for text-based jailbreaks transferred from LLMs, and 44% across all JailBreakV-28K test cases.
Modal Transferability: The text prompt component is the dominant vulnerability vector, with attacks retaining their effectiveness irrespective of the associated image—blank, arbitrary, or content-rich.
Domain-specific weaknesses: MLLMs are especially susceptible in domains pertaining to economic harm and malware, but global vulnerabilities persist across most policy areas.
Underlying mechanism: The text encoder within the MLLM is the locus of inherited vulnerability; attaching image inputs does not mitigate fundamental text-induced misalignment.
Attack method invariance: Whether attacking with templates, logic manipulations, or persuasive rephrasing, nearly all advanced prompt strategies exhibit high transfer success to the multimodal context.

5. Practical and Research Implications

The large-scale, controlled experiments demonstrate:

The addition of visual modality does not inherently confer additional safety: inherited LLM weaknesses are readily exploitable by contemporary adversarial prompting strategies.
Model improvement strategies must address both modalities jointly: dual-channel alignment and filtering approaches are mandated to defend against hybrid attacks.
Benchmark design: JailBreakV-28K, with its stratified attack types and modality combinations, sets a new standard for safety auditing in MLLMs and enables fine-grained ablation studies on both attack and defense mechanisms.
Utility for the research community: The open-access distribution of dataset, attack recipes, and evaluation scripts supports reproducibility and extension across competing MLLMs.

6. Limitations and Areas for Future Work

The findings of JailBreakV-28K reveal persistent vulnerabilities not only in the evaluated MLLMs but also in the prevailing approaches toward safety alignment:

Attacks are disproportionately effective on text channels, suggesting a bottleneck in current LLM safety architectures that is directly implicated in MLLMs.
Current safety evaluators (such as Llama Guard) are responsible for attack classification; advances in safety scoring or judge models could alter absolute ASR measurements.
The benchmark provides comprehensive coverage within a predefined set of policy categories but can be incrementally extended as new adversarial methods and harmful content types evolve.
Dual-modality (text/image) defense is still largely unexplored; the results imply an urgent need for cross-modal mitigation techniques.

7. Accessibility and Open Science

All code, data, and benchmark resources are made openly available at https://github.com/qiuhuachuan/latent-jailbreak. This ensures the following:

Full reproducibility for model evaluation, ablation, and defense development.
Facilitation of downstream research into prompt engineering, adversarial alignment, and robustness training.
Benchmark extension and maintenance as new threats are discovered, supporting standardization and comparability in the evaluation of future multimodal AI systems.

PDF Markdown Chat (Pro)

References (1)

JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks (2024)

Follow Topic

Get notified by email when new papers are published related to JailBreakV-28K Benchmark.