OutSafe-Bench: Multimodal MLLM Safety Benchmark
- OutSafe-Bench is a comprehensive framework that evaluates MLLM safety across text, image, audio, and video modalities in both Chinese and English.
- It introduces a Multidimensional Cross Risk Score (MCRS) and FairScore system to assess overlapping risks and ensure robust, bias-mitigated evaluation.
- Empirical results reveal modality-dependent vulnerabilities, underscoring the need for improved safety strategies in advanced multimodal large language models.
OutSafe-Bench is a benchmarking framework and dataset suite created to comprehensively evaluate the safety of Multimodal LLMs (MLLMs) with respect to offensive content, spanning text, image, audio, and video modalities. It establishes a multi-dimensional, multi-modality, and bilingual (Chinese/English) testbed for content risk detection and introduces theoretically-grounded metrics for evaluating overlapping and correlated risks, with an explicit emphasis on nuanced, cross-modal vulnerabilities present in contemporary MLLMs (Yan et al., 13 Nov 2025).
1. Dataset Design and Annotation Protocol
OutSafe-Bench comprises a large-scale, systematically curated corpus engineered to reveal the offensive-content risks posed by MLLMs in practical deployment. The dataset covers four modalities:
- Text: 18,000 prompts (9,000 Chinese, 9,000 English), evenly distributed across nine risk categories (1,000 samples/category/language).
- Image: 4,500 photographs/document images (500/category).
- Audio: 450 clips (170 Chinese, 280 English), drawn from hate speech, misinformation, and safety-related corpora.
- Video: 450 short videos (150 Chinese, 300 English), limited to ≤5 minutes, sampled at 1 fps.
Each instance is labeled with one of nine critical risk categories:
| Category Index | Category Name |
|---|---|
| 1 | Privacy & Property |
| 2 | Prejudice & Discrimination |
| 3 | Crime & Illegal Activities |
| 4 | Ethics & Morality |
| 5 | Violence & Hatred |
| 6 | False Info & Misdirection |
| 7 | Political Sensitivity |
| 8 | Physical & Mental Health |
| 9 | Copyright & IP |
Annotation leverages seed selection from 30 public datasets (e.g., Chinese Safety Prompts, HateMM, FakeSV), followed by expert relabeling involving three trained annotators and adherence to a comprehensive guideline. Inter-annotator agreement, measured on a 936-sample subset, yields average Cohen’s κ > 0.82 across modalities and categories. Disputes are resolved via majority vote. Quality control involves noise filtering, semantic keyword matching for image/video, and LLM-assisted transcription and mapping for audio.
2. Multidimensional Cross-Risk Scoring (MCRS)
Recognizing that a model output can implicate multiple overlapping risk categories, OutSafe-Bench introduces the Multidimensional Cross Risk Score (MCRS). For each output :
Here, reflects the severity (0: safe, 10: extremely unsafe) for category . To encode risk correlation, a cross-risk influence matrix is computed,
where denotes the normalized semantic similarity score (computed using Sentence-BERT embeddings and cosine similarity) between risks and .
Given a scenario with outputs, the mean risk per dimension and scenario-level MCRS are:
MCRS thus upweights scenarios where correlated risks co-occur, producing a single interpretable joint-risk metric.
3. FairScore: Automated Multi-Reviewer Evaluation
To mitigate bias inherent to single-model judgments, OutSafe-Bench formalizes FairScore—an automated, weighted aggregation protocol using multiple high-performing reviewer models. Each reviewer is weighted by reliability (derived from performance on external safety corpora and normalized):
Where is reviewer-’s risk assignment for output (of scenario ) from evaluated model , and is reviewer 's reliability weight (). Averaging over answers and combining with MCRS cross-risk weights yields the final FairScore for on scenario :
Empirically, FairScore increases agreement with human annotation (Kendall’s τ from 0.4057→0.4127, Spearman’s ρ from 0.5589→0.5681) and reduces variance by 30% compared to single-model approaches.
4. Empirical Assessment of Multimodal LLM Safety
OutSafe-Bench evaluates nine state-of-the-art MLLMs (Deepseek-v3, Claude-3.7-Sonnet, Gemini-2.0/2.5-flash, GPT-4o/mini, Qwen-2.5-72B, Ernie-4.0, Doubao-1.5-pro), each on identically constructed multimodal test items. For image and video, the evaluation protocol instructs models to describe content in detail, while for audio, transcription precedes risk evaluation.
Key findings:
- Textual safety: All models show lower risk on average (0.35–1.88 on [0,10]) for text, but English yields higher risk than Chinese.
- Video and audio: Amplified vulnerabilities found in video (often >2.0 risk) and audio modalities.
- Best performers: Qwen-2.5-72B yields lowest overall risk (0.9193) and leads in video safety; Claude-3.7-Sonnet is the safest for image/text only (0.7436).
- Category vulnerabilities: Political Sensitivity and False Information scenarios present highest risk; Physical & Mental Health is consistently lowest risk.
These results underscore persistent, modality-dependent safety issues in leading MLLMs.
5. Methodological Innovations and Implications
OutSafe-Bench’s core methodological advances are:
- Large-scale, cross-modal, bilingual dataset: Facilitates fine-grained risk detection and evaluation in both Chinese and English across all major input modalities.
- Explicit modeling of risk correlations: MCRS provides a principled, multidimensional measure that reflects real-world risk entanglements.
- Bias-mitigating, reliability-weighted multi-reviewer system: FairScore demonstrably increases evaluation robustness and reduces judgment variance.
- High-quality annotation: Triple-expert review and high inter-annotator agreement provide reliable ground truth.
- Applicability to real-world safety guardrails: Empirical vulnerabilities identified (especially for audio/video) indicate the necessity for further research into robust cross-modal alignment and content moderation strategies.
6. Recommendations and Future Directions
The OutSafe-Bench study suggests that:
- Video/audio safety is a primary failure case, necessitating further model alignment and specialized training for temporal/threat content.
- Political and misinformation scenarios require domain-adversarial data augmentation to mitigate persistent safety gaps.
- Weighted multi-judge and cross-risk frameworks (like FairScore/MCRS) should be incorporated in MLLM evaluation pipelines.
- Expansion to emerging modalities and low-resource languages is recommended for future benchmarks.
A plausible implication is that, as modality complexity increases, unimodal safety measures are insufficient: robust multimodal content moderation must be grounded in approaches capable of modeling and adjudicating inter-category and cross-modality risks.
For further technical details on the OutSafe-Bench benchmark and its aggregation metrics, see "OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in LLMs" (Yan et al., 13 Nov 2025).