Papers
Topics
Authors
Recent
Search
2000 character limit reached

OutSafe-Bench: Multimodal MLLM Safety Benchmark

Updated 3 February 2026
  • OutSafe-Bench is a comprehensive framework that evaluates MLLM safety across text, image, audio, and video modalities in both Chinese and English.
  • It introduces a Multidimensional Cross Risk Score (MCRS) and FairScore system to assess overlapping risks and ensure robust, bias-mitigated evaluation.
  • Empirical results reveal modality-dependent vulnerabilities, underscoring the need for improved safety strategies in advanced multimodal large language models.

OutSafe-Bench is a benchmarking framework and dataset suite created to comprehensively evaluate the safety of Multimodal LLMs (MLLMs) with respect to offensive content, spanning text, image, audio, and video modalities. It establishes a multi-dimensional, multi-modality, and bilingual (Chinese/English) testbed for content risk detection and introduces theoretically-grounded metrics for evaluating overlapping and correlated risks, with an explicit emphasis on nuanced, cross-modal vulnerabilities present in contemporary MLLMs (Yan et al., 13 Nov 2025).

1. Dataset Design and Annotation Protocol

OutSafe-Bench comprises a large-scale, systematically curated corpus engineered to reveal the offensive-content risks posed by MLLMs in practical deployment. The dataset covers four modalities:

  • Text: 18,000 prompts (9,000 Chinese, 9,000 English), evenly distributed across nine risk categories (1,000 samples/category/language).
  • Image: 4,500 photographs/document images (500/category).
  • Audio: 450 clips (170 Chinese, 280 English), drawn from hate speech, misinformation, and safety-related corpora.
  • Video: 450 short videos (150 Chinese, 300 English), limited to ≤5 minutes, sampled at 1 fps.

Each instance is labeled with one of nine critical risk categories:

Category Index Category Name
1 Privacy & Property
2 Prejudice & Discrimination
3 Crime & Illegal Activities
4 Ethics & Morality
5 Violence & Hatred
6 False Info & Misdirection
7 Political Sensitivity
8 Physical & Mental Health
9 Copyright & IP

Annotation leverages seed selection from 30 public datasets (e.g., Chinese Safety Prompts, HateMM, FakeSV), followed by expert relabeling involving three trained annotators and adherence to a comprehensive guideline. Inter-annotator agreement, measured on a 936-sample subset, yields average Cohen’s κ > 0.82 across modalities and categories. Disputes are resolved via majority vote. Quality control involves noise filtering, semantic keyword matching for image/video, and LLM-assisted transcription and mapping for audio.

2. Multidimensional Cross-Risk Scoring (MCRS)

Recognizing that a model output can implicate multiple overlapping risk categories, OutSafe-Bench introduces the Multidimensional Cross Risk Score (MCRS). For each output xx:

R(x)  =  [r1(x),r2(x),,r9(x)],ri(x)[0,10]R(x)\;=\;\bigl[r_{1}(x),\,r_{2}(x),\,\dots,r_{9}(x)\bigr],\quad r_{i}(x)\in[0,10]

Here, ri(x)r_i(x) reflects the severity (0: safe, 10: extremely unsafe) for category ii. To encode risk correlation, a cross-risk influence matrix is computed,

γ  =  [γ(p,q)]p,q=19,q=19γ(p,q)=1\gamma \;=\;\bigl[\gamma_{(p,q)}\bigr]_{p,q=1}^9, \quad \sum_{q=1}^9\gamma_{(p,q)}=1

where γ(p,q)\gamma_{(p,q)} denotes the normalized semantic similarity score (computed using Sentence-BERT embeddings and cosine similarity) between risks pp and qq.

Given a scenario kk with nn' outputs, the mean risk per dimension and scenario-level MCRS are:

rq(k)=1nt=1nrq(xt)\overline{r}_q^{(k)} = \frac{1}{n'}\sum_{t=1}^{n'} r_q(x_t)

MCRS(k)=q=19γ(k,q)rq(k)\mathrm{MCRS}(k) = \sum_{q=1}^{9} \gamma_{(k,q)} \overline{r}_q^{(k)}

MCRS thus upweights scenarios where correlated risks co-occur, producing a single interpretable joint-risk metric.

3. FairScore: Automated Multi-Reviewer Evaluation

To mitigate bias inherent to single-model judgments, OutSafe-Bench formalizes FairScore—an automated, weighted aggregation protocol using multiple high-performing reviewer models. Each reviewer RMl\mathrm{RM}_l is weighted by reliability (derived from performance on external safety corpora and normalized):

r^i(j,k,t)=l=1mλlri(j,k,t,l)\hat r_i^{(j, k, t)} = \sum_{l=1}^m \lambda_l r_i^{(j,k,t,l)}

Where ri(j,k,t,l)r_i^{(j,k,t,l)} is reviewer-ll’s risk assignment for output tt (of scenario kk) from evaluated model MjM_j, and λl\lambda_l is reviewer ll's reliability weight (lλl=1\sum_l \lambda_l = 1). Averaging over answers and combining with MCRS cross-risk weights yields the final FairScore for MjM_j on scenario kk:

FairScore(Mj,k)=q=19γ(k,q)  rq(j,k)\mathrm{FairScore}(M_j, k) = \sum_{q=1}^9 \gamma_{(k,q)}\; \overline{r}_q^{(j,k)}

Empirically, FairScore increases agreement with human annotation (Kendall’s τ from 0.4057→0.4127, Spearman’s ρ from 0.5589→0.5681) and reduces variance by 30% compared to single-model approaches.

4. Empirical Assessment of Multimodal LLM Safety

OutSafe-Bench evaluates nine state-of-the-art MLLMs (Deepseek-v3, Claude-3.7-Sonnet, Gemini-2.0/2.5-flash, GPT-4o/mini, Qwen-2.5-72B, Ernie-4.0, Doubao-1.5-pro), each on identically constructed multimodal test items. For image and video, the evaluation protocol instructs models to describe content in detail, while for audio, transcription precedes risk evaluation.

Key findings:

  • Textual safety: All models show lower risk on average (0.35–1.88 on [0,10]) for text, but English yields higher risk than Chinese.
  • Video and audio: Amplified vulnerabilities found in video (often >2.0 risk) and audio modalities.
  • Best performers: Qwen-2.5-72B yields lowest overall risk (0.9193) and leads in video safety; Claude-3.7-Sonnet is the safest for image/text only (0.7436).
  • Category vulnerabilities: Political Sensitivity and False Information scenarios present highest risk; Physical & Mental Health is consistently lowest risk.

These results underscore persistent, modality-dependent safety issues in leading MLLMs.

5. Methodological Innovations and Implications

OutSafe-Bench’s core methodological advances are:

  • Large-scale, cross-modal, bilingual dataset: Facilitates fine-grained risk detection and evaluation in both Chinese and English across all major input modalities.
  • Explicit modeling of risk correlations: MCRS provides a principled, multidimensional measure that reflects real-world risk entanglements.
  • Bias-mitigating, reliability-weighted multi-reviewer system: FairScore demonstrably increases evaluation robustness and reduces judgment variance.
  • High-quality annotation: Triple-expert review and high inter-annotator agreement provide reliable ground truth.
  • Applicability to real-world safety guardrails: Empirical vulnerabilities identified (especially for audio/video) indicate the necessity for further research into robust cross-modal alignment and content moderation strategies.

6. Recommendations and Future Directions

The OutSafe-Bench study suggests that:

  • Video/audio safety is a primary failure case, necessitating further model alignment and specialized training for temporal/threat content.
  • Political and misinformation scenarios require domain-adversarial data augmentation to mitigate persistent safety gaps.
  • Weighted multi-judge and cross-risk frameworks (like FairScore/MCRS) should be incorporated in MLLM evaluation pipelines.
  • Expansion to emerging modalities and low-resource languages is recommended for future benchmarks.

A plausible implication is that, as modality complexity increases, unimodal safety measures are insufficient: robust multimodal content moderation must be grounded in approaches capable of modeling and adjudicating inter-category and cross-modality risks.


For further technical details on the OutSafe-Bench benchmark and its aggregation metrics, see "OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in LLMs" (Yan et al., 13 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OutSafe-Bench Framework.