OutSafe-Bench: Multimodal MLLM Safety Benchmark

Updated 3 February 2026

OutSafe-Bench is a comprehensive framework that evaluates MLLM safety across text, image, audio, and video modalities in both Chinese and English.
It introduces a Multidimensional Cross Risk Score (MCRS) and FairScore system to assess overlapping risks and ensure robust, bias-mitigated evaluation.
Empirical results reveal modality-dependent vulnerabilities, underscoring the need for improved safety strategies in advanced multimodal large language models.

OutSafe-Bench is a benchmarking framework and dataset suite created to comprehensively evaluate the safety of Multimodal LLMs (MLLMs) with respect to offensive content, spanning text, image, audio, and video modalities. It establishes a multi-dimensional, multi-modality, and bilingual (Chinese/English) testbed for content risk detection and introduces theoretically-grounded metrics for evaluating overlapping and correlated risks, with an explicit emphasis on nuanced, cross-modal vulnerabilities present in contemporary MLLMs (Yan et al., 13 Nov 2025).

1. Dataset Design and Annotation Protocol

OutSafe-Bench comprises a large-scale, systematically curated corpus engineered to reveal the offensive-content risks posed by MLLMs in practical deployment. The dataset covers four modalities:

Text: 18,000 prompts (9,000 Chinese, 9,000 English), evenly distributed across nine risk categories (1,000 samples/category/language).
Image: 4,500 photographs/document images (500/category).
Audio: 450 clips (170 Chinese, 280 English), drawn from hate speech, misinformation, and safety-related corpora.
Video: 450 short videos (150 Chinese, 300 English), limited to ≤5 minutes, sampled at 1 fps.

Each instance is labeled with one of nine critical risk categories:

Category Index	Category Name
1	Privacy & Property
2	Prejudice & Discrimination
3	Crime & Illegal Activities
4	Ethics & Morality
5	Violence & Hatred
6	False Info & Misdirection
7	Political Sensitivity
8	Physical & Mental Health
9	Copyright & IP

Annotation leverages seed selection from 30 public datasets (e.g., Chinese Safety Prompts, HateMM, FakeSV), followed by expert relabeling involving three trained annotators and adherence to a comprehensive guideline. Inter-annotator agreement, measured on a 936-sample subset, yields average Cohen’s κ > 0.82 across modalities and categories. Disputes are resolved via majority vote. Quality control involves noise filtering, semantic keyword matching for image/video, and LLM-assisted transcription and mapping for audio.

2. Multidimensional Cross-Risk Scoring (MCRS)

Recognizing that a model output can implicate multiple overlapping risk categories, OutSafe-Bench introduces the Multidimensional Cross Risk Score (MCRS). For each output $x$ :

$R(x)\;=\;\bigl[r_{1}(x),\,r_{2}(x),\,\dots,r_{9}(x)\bigr],\quad r_{i}(x)\in[0,10]$

Here, $r_i(x)$ reflects the severity (0: safe, 10: extremely unsafe) for category $i$ . To encode risk correlation, a cross-risk influence matrix is computed,

$\gamma \;=\;\bigl[\gamma_{(p,q)}\bigr]_{p,q=1}^9, \quad \sum_{q=1}^9\gamma_{(p,q)}=1$

where $\gamma_{(p,q)}$ denotes the normalized semantic similarity score (computed using Sentence-BERT embeddings and cosine similarity) between risks $p$ and $q$ .

Given a scenario $k$ with $n'$ outputs, the mean risk per dimension and scenario-level MCRS are:

$\overline{r}_q^{(k)} = \frac{1}{n'}\sum_{t=1}^{n'} r_q(x_t)$

$\mathrm{MCRS}(k) = \sum_{q=1}^{9} \gamma_{(k,q)} \overline{r}_q^{(k)}$

MCRS thus upweights scenarios where correlated risks co-occur, producing a single interpretable joint-risk metric.

3. FairScore: Automated Multi-Reviewer Evaluation

To mitigate bias inherent to single-model judgments, OutSafe-Bench formalizes FairScore—an automated, weighted aggregation protocol using multiple high-performing reviewer models. Each reviewer $\mathrm{RM}_l$ is weighted by reliability (derived from performance on external safety corpora and normalized):

$\hat r_i^{(j, k, t)} = \sum_{l=1}^m \lambda_l r_i^{(j,k,t,l)}$

Where $r_i^{(j,k,t,l)}$ is reviewer- $l$ ’s risk assignment for output $t$ (of scenario $k$ ) from evaluated model $M_j$ , and $\lambda_l$ is reviewer $l$ 's reliability weight ( $\sum_l \lambda_l = 1$ ). Averaging over answers and combining with MCRS cross-risk weights yields the final FairScore for $M_j$ on scenario $k$ :

$\mathrm{FairScore}(M_j, k) = \sum_{q=1}^9 \gamma_{(k,q)}\; \overline{r}_q^{(j,k)}$

Empirically, FairScore increases agreement with human annotation (Kendall’s τ from 0.4057→0.4127, Spearman’s ρ from 0.5589→0.5681) and reduces variance by 30% compared to single-model approaches.

4. Empirical Assessment of Multimodal LLM Safety

OutSafe-Bench evaluates nine state-of-the-art MLLMs (Deepseek-v3, Claude-3.7-Sonnet, Gemini-2.0/2.5-flash, GPT-4o/mini, Qwen-2.5-72B, Ernie-4.0, Doubao-1.5-pro), each on identically constructed multimodal test items. For image and video, the evaluation protocol instructs models to describe content in detail, while for audio, transcription precedes risk evaluation.

Key findings:

Textual safety: All models show lower risk on average (0.35–1.88 on [0,10]) for text, but English yields higher risk than Chinese.
Video and audio: Amplified vulnerabilities found in video (often >2.0 risk) and audio modalities.
Best performers: Qwen-2.5-72B yields lowest overall risk (0.9193) and leads in video safety; Claude-3.7-Sonnet is the safest for image/text only (0.7436).
Category vulnerabilities: Political Sensitivity and False Information scenarios present highest risk; Physical & Mental Health is consistently lowest risk.

These results underscore persistent, modality-dependent safety issues in leading MLLMs.

5. Methodological Innovations and Implications

OutSafe-Bench’s core methodological advances are:

Large-scale, cross-modal, bilingual dataset: Facilitates fine-grained risk detection and evaluation in both Chinese and English across all major input modalities.
Explicit modeling of risk correlations: MCRS provides a principled, multidimensional measure that reflects real-world risk entanglements.
Bias-mitigating, reliability-weighted multi-reviewer system: FairScore demonstrably increases evaluation robustness and reduces judgment variance.
High-quality annotation: Triple-expert review and high inter-annotator agreement provide reliable ground truth.
Applicability to real-world safety guardrails: Empirical vulnerabilities identified (especially for audio/video) indicate the necessity for further research into robust cross-modal alignment and content moderation strategies.

6. Recommendations and Future Directions

The OutSafe-Bench study suggests that:

Video/audio safety is a primary failure case, necessitating further model alignment and specialized training for temporal/threat content.
Political and misinformation scenarios require domain-adversarial data augmentation to mitigate persistent safety gaps.
Weighted multi-judge and cross-risk frameworks (like FairScore/MCRS) should be incorporated in MLLM evaluation pipelines.
Expansion to emerging modalities and low-resource languages is recommended for future benchmarks.

A plausible implication is that, as modality complexity increases, unimodal safety measures are insufficient: robust multimodal content moderation must be grounded in approaches capable of modeling and adjudicating inter-category and cross-modality risks.

For further technical details on the OutSafe-Bench benchmark and its aggregation metrics, see "OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in LLMs" (Yan et al., 13 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OutSafe-Bench Framework.