Multimodal Safety Test Suite

Updated 26 August 2025

Multimodal Safety Test Suite is a framework that integrates structured taxonomies to assess safety risks in models processing text, images, and additional modalities.
It employs both static benchmarks and dynamic adversarial data generation to systematically reveal vulnerabilities across diverse risk dimensions.
Evaluation metrics such as attack success degree, safety scores, and bypass rates guide the development of robust multimodal AI systems.

A Multimodal Safety Test Suite (MSTS) refers to a systematic framework for evaluating the safety, robustness, and trustworthiness of models that process more than one modality—typically text and image, but encompassing modalities including sensor fusion (e.g., LiDAR and camera), structured data, or domain-specific signals. Safety, in this context, denotes the model’s ability to avoid generating, endorsing, or responding in ways that promote or facilitate harm, illegal activity, privacy violations, bias, or other undesired outcomes as defined by explicit risk taxonomies. Modern MSTS systems leverage both static and dynamic benchmarks, synthetic and real-world adversarial data, structured risk categorization, and automatic or semi-automatic evaluation protocols to expose vulnerabilities in multimodal systems under diverse scenarios and increasingly sophisticated attack surfaces.

1. Fundamental Concepts and Taxonomy

MSTS is centered around organized taxonomies of hazards and risk factors that emerge when text and image inputs combine, either inadvertently or by adversarial design, to produce unsafe outputs. Fine-grained risk taxonomies such as those used by MSTS (Röttger et al., 17 Jan 2025), S-Eval (Yuan et al., 23 May 2024), MMDT (Xu et al., 19 Mar 2025), and MSR-Align (Xia et al., 24 Jun 2025) typically encompass:

Physical safety (e.g., instructions that could lead to self-harm or criminal activity)
Mental safety (e.g., toxic content, hate speech, psychological harm)
Privacy (e.g., extraction or leakage of personal or confidential information)
Fairness and bias (e.g., discrimination, prejudicial stereotypes)
Legality and policy compliance (e.g., responses that encourage illegal actions)
Truthfulness (e.g., hallucination, misinformation)
Adversarial robustness (e.g., ability to resist jailbreaking or prompt manipulation)
Out-of-distribution (OOD) generalization (robustness to unseen styles and perturbations)

Risk categories are assigned at multiple granularity levels: dimensions, categories, subcategories, and individual test cases.

2. Benchmark Construction and Data Generation

Construction of an MSTS involves creating multimodal benchmarks wherein the safety risk is emergent from the combination of modalities. Key aspects include:

Manual Curation and Template Design: MSTS (Röttger et al., 17 Jan 2025) contains 400 multimodal test prompts built from 200 text-image pairs, each supplemented with two prompt templates, across 40 hazard categories. Prompts are crafted such that the unsafe meaning only arises when both text and image are considered.
Automated Data Generation and Leaklessness: VLSBench (Hu et al., 29 Nov 2024) introduces rigorous pipelines to prevent VSIL (Visual Safety Information Leakage), ensuring that the text does not reveal the harmful content present in the image. Detoxification steps and image synthesis via Stable Diffusion create challenging scenarios where safety judgments require multimodal integration.
Dynamic Benchmarking: SDEval (Wang et al., 8 Aug 2025) formalizes the process as $P' = \mathcal{D}(P)$ , where dynamic strategies (text perturbation, image manipulation, cross-modal jailbreaking, or style transfer) are algorithmically applied to generate new test samples and continuously elevate evaluation complexity.

Benchmark datasets are increasingly multi-lingual (MSTS translations to 10 languages (Röttger et al., 17 Jan 2025)) and sourced from social platforms to avoid contamination and redundancy (MLLMGuard (Gu et al., 11 Jun 2024)).

3. Evaluation Metrics and Automated Scoring

Evaluation in MSTS leverages quantitative and qualitative metrics tailored to different safety dimensions and benchmarking needs:

Attack Success Degree (ASD) and Perfect Answer Rate (PAR): Used in MLLMGuard (Gu et al., 11 Jun 2024) and SDEval (Wang et al., 8 Aug 2025), where

$\text{ASD}_i = \left( \frac{ \sum_{p, r \in R_i} \text{Smooth}( \text{Scoring}(MLLM(p, r))) }{ | R_i | } \right) \times 100$

and

$\text{PAR}_i = \left( \frac{ \sum_{p, r \in R_i} \mathbb{I}( \text{Scoring}(MLLM(p,r)) = 0 ) }{ | R_i | } \right) \times 100\%$

Safety Score (SS) and Attack Success Rate (ASR): As in S-Eval (Yuan et al., 23 May 2024),

$SS_{(r)} = \frac{ \sum_{p_i \in P^{B}_{(c)} } \mathcal{J}(p_i,r) }{ |P^{B}_{(c)}| }$

where $\mathcal{J}(\cdot)$ is a binary evaluation.

Bypass Rate (BR) and Harmful Generation Rate (HGR): MMDT (Xu et al., 19 Mar 2025),

$BR = \frac{1}{n} \sum_{i=1}^{n} 1[ F(x_i) = 0 ]$

and

$HGR = \frac{1}{n} \sum_{i=1}^{n} 1[ M(G(x_i)) = y_i ]$

Multimodal Consistency (MC): MultiTest (Gao et al., 25 Jan 2024) and related sensor fusion versions,

$MC = \frac{1}{|O|} \sum_{o \in O} IOU(B_i, B_p)$

quantifies alignment between the 3D bounding box and the projected 2D detection.

Automated scoring may use lightweight evaluators, e.g., GuardRank (Gu et al., 11 Jun 2024), or strong multimodal judges, e.g., GPT-4o (Xia et al., 24 Jun 2025).

4. Adversarial and Scenario-based Testing

Robustness evaluation is integral to MSTS, accounting for realistic and worst-case conditions:

Jailbreak and Red-Teaming Techniques: MMDT (Xu et al., 19 Mar 2025) utilizes SneakyPrompt and RL-based adversarial optimization.
Scenario-driven Simulation: MultiTest (Gao et al., 25 Jan 2024) and TRACE (Luo et al., 4 Feb 2025) synthesize physical-aware multimodal instances (e.g., object insertion in LiDAR/image/point cloud for vehicle perception), and extract real-world accident scenarios through multimodal parsing of crash reports. Special attention is given to maintaining geometric and semantic consistency in synthetic data.
Cross-modal Attacks and Dynamic Perturbation: SDEval (Wang et al., 8 Aug 2025) applies cross-modal jailbreaking (e.g., typographic prompt editing and unsafe-word injection into images) to probe weaknesses in joint modality alignment.

5. Experimental Findings and Safety Outcomes

Extensive evaluation across open-source and commercial models yields several critical observations:

Commercial VLMs often demonstrate fewer unsafe outputs, though in some cases this is due to misunderstanding (“safe by accident”) rather than genuine safety by design (Röttger et al., 17 Jan 2025).
Multimodal Vulnerabilities: Models typically underperform with multimodal (text-image) inputs compared to text-only, and certain non-English prompts exacerbate unsafe behaviors (Röttger et al., 17 Jan 2025). In multimodal fusion systems, safety degradation is evidenced when chain-of-thought reasoning is extended, particularly on jailbreak robustness benchmarks (Lou et al., 10 May 2025).
Fragility under Dynamic Benchmarks: SDEval (Wang et al., 8 Aug 2025) demonstrates that injecting text or image dynamics can dramatically reduce safety metrics, revealing that static benchmarks may be an unreliable barometer for real-world safety.
Oversensitivity and Trade-offs: MMSafeAware (Wang et al., 16 Feb 2025) documents that heightened safety alignment can lead to misclassification of benign content as unsafe, impairing helpfulness—callouts for improved multimodal fusion strategies.

6. Alignment, Policy Grounding, and Future Directions

Recent methodologies advocate for advanced fine-tuning (MSR-Align (Xia et al., 24 Jun 2025)), introducing multimodal, policy-grounded, chain-of-thought supervision. The alignment process embeds explicit rationale referencing standardized policies and visual groundings, showing substantial gains in robustness against jailbreaking and nuanced unsafe cues, without loss of general reasoning capability.

Future priorities are identified as:

Designing leakless benchmarks that preclude VSIL, forcing genuine multimodal safety reasoning (Hu et al., 29 Nov 2024).
Dynamic co-evolution of benchmarks to track progress and evade data contamination (Wang et al., 8 Aug 2025).
Multi-agent and staged reasoning pipelines for nuanced situational safety assessment (Zhou et al., 8 Oct 2024).
Development of scalable, interpretable safety classifiers and evaluation frameworks tailored to the rapid evolution of AI multimodality (Röttger et al., 17 Jan 2025, Xu et al., 19 Mar 2025).

7. Availability and Standardization

Major MSTS datasets, frameworks, and benchmarking platforms are publicly available:

Resource	Type	Website / Repository
MSTS (VLMs)	Static benchmark	https://github.com/safetybench/MSTS
VLSBench	Leakless images	https://github.com/AI45Lab/VLSBench
SDEval	Dynamic eval	https://github.com/hq-King/SDEval
MSR-Align	Policy grounding	https://huggingface.co/datasets/Leigest/MSR-Align
MMDT	Trustworthiness	https://mmdecodingtrust.github.io/
TRACE (ADS)	Scenario-based	https://github.com/xxx/TRACE
S-Eval (LLM)	Multilingual	https://github.com/IS2Lab/S-Eval
MMSafeAware	Safety awareness	[will be made public]

This ecosystem supports iterative improvement, comparative studies, and co-evolution of benchmarks and models in accordance with modern regulatory and technical criteria for safe multimodal AI deployment.

In summary, a Multimodal Safety Test Suite constitutes a multidimensional, continually evolving framework integrating rigorous taxonomies, benchmark generation (static and dynamic), robust evaluation protocols, and alignment strategies—to diagnose and mitigate safety vulnerabilities of multimodal models across modalities, languages, domains, and adversarial surfaces. Ongoing work now centers on dynamic co-evolution, leakless data synthesis, interpretable chain-of-thought safety rationales, and policy-grounded benchmarking to ensure reliable and scalable model deployment in safety-critical applications.