Papers
Topics
Authors
Recent
2000 character limit reached

Omni-SafetyBench: Multi-Modal LLM Safety

Updated 11 December 2025
  • Omni-SafetyBench is a comprehensive evaluation framework for multi-modal LLM safety, integrating text, image, and audio assessments.
  • It introduces novel comprehension-aware metrics and cross-modal consistency scores to diagnose safety vulnerabilities across 24 modality configurations.
  • Experiments reveal that while unimodal safety is high, multi-modal setups expose significant drops and inconsistencies in model safety performance.

Omni-SafetyBench is a large-scale, parallelized evaluation framework tailored for the safety assessment of omni-modal LLMs (OLLMs)—models integrating textual, visual, and auditory perception and generation. It is explicitly designed to address the limitations of prior safety benchmarks, which inadequately measure model robustness under joint audio-visual-text inputs and are incapable of quantifying safety consistency across modality transformations. With purpose-built datasets, comprehension-aware metrics, and cross-modal consistency measures, Omni-SafetyBench establishes a rigorous foundation for diagnosing and improving the safety characteristics of next-generation OLLMs (Pan et al., 10 Aug 2025).

1. Benchmark Construction and Modality Coverage

Omni-SafetyBench systematically extends 972 “seed” harmful text prompts (from MM-SafetyBench) into 24 distinct modality-combination sub-datasets, yielding a total of 23,328 parallel test samples. These 24 sub-datasets are grouped into three paradigms:

  • Unimodal (4 configurations):
    • Text-only
    • Image-only (static textual content rendered as image)
    • Video-only (silent video rendering text word-by-word)
    • Audio-only (text-to-speech, ≤10 seconds)
  • Dual-modal (8 configurations):
    • Image-Text, including three distinct image variants for the “key phrase”: Diffusion-generated, Typographic (TYPO), and Diffusion + TYPO composite; always accompanied by full textual instructions.
    • Video-Text, with three analogous video variants.
    • Audio-Text, featuring both standard and white-noise-mixed speech synthesis.
  • Omni-modal (12 configurations):
    • Image-Audio-Text: All combinations of the three image and two audio variants plus text.
    • Video-Audio-Text: The analogous set using video.

Each sub-dataset maintains parallelism, ensuring that for each seed prompt, all 24 transformation variants exist. Visuals leverage Stable-Diffusion-XL (images) or Pyramidal Flow (videos); speech employs Microsoft Edge-TTS. Text prompts are semantically normalized to match the modality—e.g., “The image and audio show…” replaces “The image shows…” in the appropriate setting. A unique aspect is the explicit encoding of “dedicated audio-visual harm cases”: in dual- and omni-modal variants, harmful key phrases are redundantly embedded in both visuals and audio to maximally stress cross-modal recognition and safety mechanisms (Pan et al., 10 Aug 2025).

2. Safety Evaluation Metrics

Omni-SafetyBench introduces comprehension-aware safety metrics, recognizing that OLLMs may fail to “understand” complex, multi-modal prompts—thus, refusal or output safety should only be assessed conditional on successful comprehension.

  • Two-stage Evaluation: For every question–answer pair, models are first checked for comprehension, then (on understood cases) for safety—distinguishing between unsafe outputs and explicit refusals.

Key metrics include:

Metric Definition (on understood cases) Range
C-ASR C-ASR=N(unsafeunderstand)N(understand)\mathrm{C\text{-}ASR} = \frac{N(\text{unsafe} \wedge \text{understand})}{N(\text{understand})} [0, 1]
C-RR C-RR=N(refuseunderstand)N(understand)\mathrm{C\text{-}RR} = \frac{N(\text{refuse} \wedge \text{understand})}{N(\text{understand})} [0, 1]
Safety-score Safety-score=(1C-ASR)(1+λC-RR)1+λ,λ=0.5\mathrm{Safety\text{-}score} = \frac{(1-\mathrm{C\text{-}ASR})(1+\lambda\,\mathrm{C\text{-}RR})}{1+\lambda},\,\lambda=0.5 [0, 1]
  • Cross-Modal Safety Consistency Score (CMSC-score):

    • For Safety-scores sis_i across the 24 sub-datasets:

    μ=124i=124si,σ=124i=124(siμ)2\mu = \frac1{24}\sum_{i=1}^{24}s_i, \quad \sigma = \sqrt{\frac1{24}\sum_{i=1}^{24}(s_i - \mu)^2}

    CMSC-score=exp(ασ),  α=5\mathrm{CMSC\text{-}score} = \exp(-\alpha \sigma),\;\alpha=5 - CMSC-score close to 1 implies that Safety-scores are tightly clustered—a metric of cross-modal output consistency (Pan et al., 10 Aug 2025).

3. Experimental Protocol and Key Results

Omni-SafetyBench was used to systematically evaluate 10 OLLMs (6 open-source, 4 closed-source Gemini API variants). Each model is tested on all 24 modality combinations, with per-prompt evaluation for comprehension, attack success, and refusal, and all derived metrics.

Principal findings include:

  • No Model Achieves Both High Safety and Consistency: Only three models (gemini-2.5-pro, gemini-2.5-pro-preview, Qwen2.5-Omni-7B) attain both Safety-score and CMSC-score exceeding 0.6; the maximum observed is ~0.8 on either metric.
  • Safety is Inversely Correlated with Input Complexity: Top unimodal performers score 0.88–0.90 (gemini-2.5-flash, VITA-1.5), but dual-modal and omni-modal scores drop as much as 0.2–0.3. For image-audio-text, best open-source model peaks at 0.65; many open models fall below 0.5.
  • Weakest Links Are Pronounced: Some model-modality pairs score as low as 0.14 (MiniCPM-o-2.6 on Diffusion+TYPO image+text), and all open-source models are below 0.6 in their most vulnerable modality.
  • Cross-Modal Consistency Remains Elusive: CMSC-scores range from ~0.42 (MiniCPM-o) to 0.83 (gemini-2.5-pro). Even top closed-source “flash” models reach only 0.58 (Pan et al., 10 Aug 2025).

4. Structural Analysis and Vulnerability Diagnoses

The inability of any model to excel in both overall safety and cross-modal consistency is attributed to the compartmentalization of safety alignment procedures (typically per-modality), leaving models under-exposed to joint, cross-modal harmful cases. Effective safety alignment in one modality can result in brittle refusals or oversights in other modalities. Safety-score magnitude is frequently traded off against the variance of scores across modalities (i.e., consistency).

A directly observable phenomenon is improper refusal: models may refuse in some modalities but answer in others for the identical semantic prompt, undermining safety guarantees in real-world, adversarial settings. Conversely, models with strong “always refuse” policies can score well for safety on average, while exhibiting low consistency due to misaligned refusal coverage across modalities (Pan et al., 10 Aug 2025).

5. Implications and Future Research Directions

Omni-SafetyBench elucidates several urgent research needs for OLLM deployment and development:

  • Cross-modal Safety Alignment: Fine-tuning should include explicit, joint audio-visual-text harmful examples, using typographic images and noisy speech, to ensure broad pattern coverage.
  • Comprehension Pre-requisites: Refusals must reflect genuine safety interventions, not model confusion or modality comprehension failures—necessitating improved (and testable) multi-modal understanding before reliable safety mechanisms can activate.
  • Consistency-aware Objectives: Training protocols should introduce regularizers or loss terms targeting reduced variance in safety outputs across modality transformations (CMSC-inspired regularization).
  • Benchmark-driven Model Iteration: By systematically exposing and quantifying “shortest planks”—the model-modality pairs with the lowest safety—Omni-SafetyBench enables targeted safety engineering and evaluation cycles (Pan et al., 10 Aug 2025).

6. Comparative Context: Relationship to Other Multimodal Safety Benchmarks

Preceding frameworks such as MM-SafetyBench and JailBreakV-28K provided important, but modality-limited, benchmarks. Recent developments like OmniSafeBench-MM (Jia et al., 6 Dec 2025) expand evaluation to jailbreak attack/defense coverage, diverse risk domains, and multi-factor scoring—but focus on image–text inputs and three-dimensional scoring (harmfulness, alignment, detail). Omni-SafetyBench fills a distinct gap by addressing audio-visual-text safety, parallelized multi-modal consistency, and comprehension filtering, making it uniquely positioned for OLLMs. These complementary efforts collectively drive methodological rigor and standardization in multi-modal AI safety assessment.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Omni-SafetyBench.