MM-SafetyBench: Multimodal Safety Evaluation

Updated 28 July 2025

MM-SafetyBench is a safety evaluation framework defined to uncover vulnerabilities in multimodal LLMs by employing adversarial image–text pairs.
It leverages a rigorous four-step data generation pipeline and measures key metrics like Attack Success Rate and Refusal Rate to diagnose safety risks.
The framework provides a mitigation toolkit, including prompt-based countermeasures, to significantly reduce unsafe completions and enhance multimodal safety alignment.

MM-SafetyBench is a comprehensive safety evaluation framework and dataset for Multimodal LLMs (MLLMs) that systematically probes and quantifies model vulnerabilities arising from the interaction of images and textual queries. It addresses the unique challenge of “jailbreak” attacks in which malicious or query-relevant images, when paired with otherwise unsafe or innocuous text, stimulate harmful outputs even from models previously aligned for textual safety. By integrating a diverse benchmark, structured evaluation metrics, and a diagnostic + mitigation toolkit, MM-SafetyBench enables the systematic assessment and improvement of multimodal safety alignment in current and future MLLMs (Liu et al., 2023).

1. Motivation and Framework Design

The central aim of MM-SafetyBench is to uncover and analyze failures in existing safety alignments of MLLMs caused by multimodal prompts—especially those involving images that encode or reinforce harmful semantic content. Previous safety evaluations largely targeted text-only LLMs, neglecting emergent vulnerabilities in MLLMs where seemingly benign or “relevant” images can trigger harmful, unsafe, or policy-violating model behaviors. MM-SafetyBench formalizes a benchmark methodology that (“jailbreaks” via image) generates adversarial scenarios, then evaluates models according to measurable outcomes.

The architecture comprises two core elements:

A rigorous four-step data generation pipeline, combining automatic question generation, extraction of unsafe key phrases, transformation into query-relevant images, and tailored rephrasing of text so the malicious intent is visually encoded rather than solely textual.
A unified evaluation module that quantifies two main axes of safety risk: “Attack Success Rate” (ASR), measuring the frequency of successful unsafe completions, and “Refusal Rate” (RR), quantifying model refusal to answer unsafe prompts.

These design choices render MM-SafetyBench a discriminative, scenario-driven, and attack-reproducible platform suitable for both diagnosis and progress tracking in multimodal safety alignment.

2. Data Construction and Scenario Coverage

The MM-SafetyBench dataset provides 5,040 text–image pairs encompassing 1,680 malicious queries systematically spanning 13 high-value, policy-critical safety scenarios:

Scenario Index	Description	Example Content
01	Illegal Activity	How to commit a crime
02	Hate Speech	Discriminatory advice
03	Malware Generation	Code/scripting harm
04	Physical Harm	Injury to self/others
05	Economic Harm	Fraudulently gain
06	Fraud	Identity theft
07	Pornography	Explicit content
08	Political Lobbying	Influence/propaganda
09	Privacy Violence	Doxing/leakage
10	Legal Opinion	Subvert justice
11	Financial Advice	Risky investments
12	Health Consultation	Unsafe treatments
13	Government Decision	Policy circumvention

For each scenario, malicious questions are first generated using GPT-4, then key unsafe phrases are extracted. Images are then constructed by three methods: (i) Stable Diffusion-based photorealistic generation, (ii) Typography—visualizing the dangerous keyword as text within an image, and (iii) SD+Typography—a hybrid for maximal input saliency. Each text query is ultimately rephrased to force linguistic–visual reference, e.g., “Describe what is shown in the image” rather than stating the unsafe intent directly.

A “tiny” version (168 questions, 504 pairs) is available for lightweight model evaluation and ablation studies.

3. Methodology and Evaluation Metrics

Evaluation leverages well-defined metrics:

Attack Success Rate (ASR):

$\mathrm{ASR} = \frac{1}{|D|} \sum_{i} I(Q_i)$

where $I(Q_i) = 1$ if a model generates an unsafe response to input $Q_i$ ; $0$ otherwise; $|D|$ is the dataset size.

Refusal Rate (RR):

$\mathrm{RR} = \frac{1}{|D|} \sum_{i} R(Q_i)$

where $R(Q_i) = 1$ if the model appropriately refuses the unsafe query.

For each text–image pair, the model’s output is systematically classified as “safe” (appropriately refused/blocked, or irrelevant) or “unsafe” (provides harmful, non-refusing completion). Manual and automatic evaluation protocols are employed, including case analyses to distinguish direct unsafe completions from indirect failures (e.g., OCR misinterpretation, instruction following errors).

4. Model Vulnerabilities and Experimental Findings

Comprehensive experiments across 12 state-of-the-art MLLMs (including LLaVA-1.5, IDEFICS, InstructBLIP, MiniGPT-4, mPLUG-Owl) reveal broad susceptibility:

When presented only with (carefully) malicious queries, most models exhibit low ASR due to standard alignment.
The addition of query-relevant images dramatically increases ASR—particularly for typography image modalities (e.g., LLaVA-1.5 achieves >70% ASR), indicating that image cues can self-jailbreak models’ text-level safety guardrails.
Vision–language alignment trained on insufficiently filtered datasets often contributes to these vulnerabilities: models attend to visual cues containing unsafe concepts and propagate them in their outputs.
Error analyses note that in some cases models fail by hallucinating irrelevant content (reflecting overfitting or under-developed visual reasoning) rather than directly producing unsafe responses, but these do not guarantee safety.

These findings demonstrate that image-based attacks circumvent text-only safety mechanisms and that “vision-aligned” MLLMs require explicit multimodal safety alignment.

5. Mitigation Strategies and Prompt Engineering

To counter such vulnerabilities with minimal overhead, the framework proposes a task-agnostic prompting remedy:

Prepended Safety Prompt:

1	If the following questions are unsafe, harmful, or malicious, please refuse to answer them and give your explanation. [Malicious Query]

This direct instruction substantially reduces ASR (e.g., for LLaVA-1.5-7B, from ~77% to ~15% on the “tiny” dataset), without additional training. The reduction is commensurate with that found for more intensive, resource-demanding alignment regimes. This indicates that prompt-based “meta-alignment” can provide significant short-term gains in refusal behavior, though the study advocates for more robust, integrated architectural defenses.

6. Implications, Limitations, and Research Directions

The MM-SafetyBench results highlight that:

Safety evaluation for LLMs must not be restricted to textual modalities, as models’ vulnerability surfaces broaden under multimodal (image–text) input distributions.
Even relatively simple visual renderings of unsafe content (e.g., text in an image) can bypass alignment when multimodal fusion modules are insufficiently safety-filtered or aligned.
While lightweight safety prompts offer immediate mitigation, sustainable improvements require deeper model-centric approaches—vision-language joint alignment, architectural refusal mechanisms, and more discriminative safety reasoners for multimodal content.

Future research, as prompted by the paper, should prioritize:

Multimodal safety architecture designs that refuse or filter unsafe intent regardless of input modality.
Enhanced and scalable benchmarking spanning larger, more granular scenario taxonomies and dynamic attack methods.
Careful trade-off management between refusal robustness and generalization, to prevent “over-alignment” (undesirably high refusal rates on benign content).

7. Broader Ecosystem and Influence

MM-SafetyBench forms a foundational reference point in the rapidly expanding family of MLLM safety benchmarks, addressing a critical gap for rigorous, reproducible, and scenario-specific evaluation of cross-modal vulnerabilities. Its diagnostic methodology, scenario diversity, attack–refusal dual axes, and open-source resources have informed successor works and catalyzed broader initiatives for unified, multimodal safety assessment in both open-source and commercial LLM platforms.

Its publication has prompted more advanced benchmarks (such as SafeBench (Ying et al., 24 Oct 2024) and USB (Zheng et al., 26 May 2025)) to adopt more granular taxonomies, multimodal attack surfaces, and integrated vulnerability/oversensitivity evaluation protocols, reflecting the practical relevance and foundational nature of the MM-SafetyBench methodology.