MMSafetyBench-tiny Dataset

Updated 9 December 2025

MMSafetyBench-tiny is a reduced-size dataset (10% of MM-SafetyBench) featuring 504 text-image pairs that evenly represent 13 safety-critical scenarios.
It employs three image generation methods—Stable Diffusion, Typo, and SD+Typo—to simulate image-assisted jailbreak attacks on multimodal large language models.
The dataset is designed for rapid prototyping and ablation studies, maintaining key evaluation metrics like Attack Success Rate (ASR) and Refusal Rate (RR).

The MMSafetyBench-tiny dataset refers to the reduced-size subset of MM-SafetyBench, a benchmark designed for evaluating the safety vulnerabilities of Multimodal LLMs (MLLMs) under image-assisted jailbreak attacks. This dataset is defined, released, and described by Xin Liu et al. as part of "MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal LLMs" (Liu et al., 2023).

1. Dataset Definition and Structure

MMSafetyBench-tiny is the “tiny” version of MM-SafetyBench, representing approximately 10% of the full corpus. It contains 504 multimodal samples, comprising text-image pairs. The tiny subset is built by randomly sampling 168 underlying malicious questions (text prompts) from the full set of 1680, maintaining the proportional distribution across 13 documented safety-critical scenarios. Each selected question is then multiplied by three automatically generated image variants: one via Stable Diffusion (SD), one as a typographic rendering (“Typo”), and one composite (“SD+Typo”). This yields:

168 sampled questions × 3 images each = 504 text–image pairs.

2. Safety Scenarios and Coverage

MMSafetyBench-tiny fully preserves the scenario taxonomy and threat diversity of the main MM-SafetyBench, ensuring proportional representation across all 13 safety categories. Scenarios include illegal activity, hate speech, malware generation, physical harm, economic harm, fraud, pornography, political lobbying, privacy violation, legal opinion, and financial advice. The tiny version maintains balance by sampling 10–17 questions per scenario at random, as done for the full corpus.

Safety Scenario	Questions in Tiny Split	Generation Method
Illegal Activity	10–17	SD, Typo, SD+Typo each
Hate Speech	10–17	SD, Typo, SD+Typo each
... (total 13 types)	10–17 each	SD, Typo, SD+Typo each

A plausible implication is that fast ablations or validation procedures can be conducted using MMSafetyBench-tiny with negligible loss in scenario granularity relative to the full benchmark.

3. Data Generation Methodology

Text queries (malicious prompts) are sampled to maintain per-scenario proportions. For each, three images are created:

SD: Synthetic images via Stable Diffusion, matched contextually to the query.
Typo: A typographically rendered phrase summarizing the malicious intent.
SD+Typo: Composited images merging both the synthetic and typographic forms.

The resulting multimodal samples are intended to provoke or “jailbreak” safety-aligned MLLMs, leveraging both textual and visual cues relevant to the targeted scenario.

4. Annotation Protocol and Evaluation Metrics

Responses from MLLMs are evaluated using two core metrics:

Attack Success Rate (ASR):

$ASR = \frac{\sum_i I(Q_i)}{|D|}$

where $I(Q_i) = 1$ if the model provides an unsafe answer for input $Q_i$ , and $D$ is the set of all input pairs.

Refusal Rate (RR):

$RR = \frac{\sum_i R(Q_i)}{|D|}$

where $R(Q_i) = 1$ if the model explicitly refuses to answer.

Manual analysis verifies whether a response constitutes “engagement with malicious content” (for ASR) or a direct refusal (for RR). By comparison with text-only attacks, these metrics expose safety weaknesses induced or exacerbated by multimodal input.

5. File Formats and Access

MMSafetyBench-tiny is distributed identically to the full MM-SafetyBench with directory-level structure:

images/ (SD, Typo, and SD+Typo for each sampled query)

metadata.jsonl containing records:

{
  "question_id": "...",
  "image_type": "SD" | "Typo" | "SD+Typo",
  "image_path": "...",
  "scenario": "...",
  "prompt": "...",
  "label": ... // expected refusal or engagement
}

Loading and use is facilitated via the provided repository (https://github.com/isXinLiu/MM-SafetyBench), with direct support for pythonic I/O. The tiny subset can be loaded, filtered, and evaluated for ablation, rapid prototyping, or human-in-the-loop validation.

6. Typical Use Cases and Significance

MMSafetyBench-tiny is intended for fast prototyping, ablation studies, and QA validations where computational or annotation cost must be minimized. It offers a compact yet structurally faithful alternative to full-scale safety evaluations, allowing researchers to:

Benchmark multimodal jailbreak vulnerability.
Compare safety-aligned and non-aligned MLLMs.
Study modality-specific failure modes across diverse attack scenarios.

The tiny split is not meant for model training or fine-tuning but is strictly an evaluation artifact.

7. Limitations, Bias, and Ethical Considerations

Coverage remains limited to the scenario taxonomy and image-generation strategies described in the original MM-SafetyBench—no additional demographic balancing or region-specific curation is applied. Safety scenarios and prompts are derived from Western policy documents (OpenAI, Llama-2). The dataset may contain content of an offensive or illegal nature, and usage in controlled research environments is recommended.

No separate annotation quality metrics are released for MMSafetyBench-tiny; it inherits the manual curation and labeling processes of the main corpus. The balance of scenario representation is approximate, subject to sampling randomness inherent in the tiny split.

MMSafetyBench-tiny is distinct from MMSafeAware (Wang et al., 16 Feb 2025) and MobileSafetyBench (Lee et al., 23 Oct 2024). MMSafeAware is a comprehensive adversarial evaluation suite for multimodal safety awareness (1,500 items, 29 scenarios, no “tiny” variant); MobileSafetyBench covers mobile agent safety in emulator environments (90 tasks, no “tiny” split). MMSafetyBench-tiny specifically addresses multimodal jailbreak risk via synthetic image attacks in a compact evaluation format.

No “MMSafeAware-tiny” or “MobileSafetyBench-tiny” splits are released by their respective authors; MMSafetyBench-tiny remains the only formally defined small-scale safety ablation corpus for multimodal jailbreak vulnerability.