Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 102 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 25 tok/s

GPT-5 High 35 tok/s Pro

GPT-4o 99 tok/s

GPT OSS 120B 472 tok/s Pro

Kimi K2 196 tok/s Pro

2000 character limit reached

VLSBench: Unveiling Visual Leakage in Multimodal Safety (2411.19939v3)

Published 29 Nov 2024 in cs.CR, cs.AI, cs.CL, and cs.CV

Abstract: Safety concerns of Multimodal LLMs (MLLMs) have gradually become an important problem in various applications. Surprisingly, previous works indicate a counterintuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs aligned with image text pairs. To explain such a phenomenon, we discover a Visual Safety Information Leakage (VSIL) problem in existing multimodal safety benchmarks, i.e., the potentially risky content in the image has been revealed in the textual query. Thus, MLLMs can easily refuse these sensitive image-text pairs according to textual queries only, leading to unreliable cross-modality safety evaluation of MLLMs. We also conduct a further comparison experiment between textual alignment and multimodal alignment to highlight this drawback. To this end, we construct multimodal Visual Leakless Safety Bench (VLSBench) with 2.2k image-text pairs through an automated data pipeline. Experimental results indicate that VLSBench poses a significant challenge to both open-source and close-source MLLMs, e.g., LLaVA, Qwen2-VL and GPT-4o. Besides, we empirically compare textual and multimodal alignment methods on VLSBench and find that textual alignment is effective enough for multimodal safety scenarios with VSIL, while multimodal alignment is preferable for safety scenarios without VSIL. Code and data are released under https://github.com/AI45Lab/VLSBench

Collections

Summary

The paper identifies visual safety information leakage (VSIL) in multimodal large language models, where sensitive visual data inadvertently appears in text queries.
The study constructs VLSBench, a specialized benchmark ensuring text queries do not reveal sensitive image details, to expose shortcomings in textual-only safety evaluations.
Experimental results show that multimodal alignment strategies significantly outperform simple textual methods in mitigating safety risks.

VLSBench: Unveiling Visual Leakage in Multimodal Safety

Introduction to Multimodal Safety Concerns

The paper of Multimodal LLMs (MLLMs), which integrate visual inputs with traditional text-based LLMs, has revealed significant concerns regarding safety in their applications. The paper identifies a previously unexplored issue termed "Visual Safety Information Leakage" (VSIL), where sensitive visual content inadvertently leaks into textual queries. This problem leads to misleading safety evaluations, as simple text-based alignment methods can erroneously achieve comparable safety results without genuinely addressing the visual component of multimodal interactions.

Figure 1: Overview of our work. We have discovered a problem in current multimodal safety data samples, which says visual safety information leakage (VSIL). Based on this leakage, we further find it leads to a counter-intuitive problem, that simpler SFT-based alignment methods can perform nearly the same high safety rate. Thus, we construct VLSBench, preventing visual leakage. This newly proposed task discourages textual alignment and motivates more dedicated multimodal alignment methods to better solve this challenging task.

Visual Safety Information Leakage

VSIL is identified when the content of an image, which should be independently harmful, is explicitly described in its accompanying text query. This causes existing safety models to inadequately evaluate safety since the decision can be made without actual consideration of the visual input. The paper provides quantitative verification of VSIL through a comparison using the LlamaGuard model, highlighting significant discrepancies between unsafe image-text pairs and solely text-derived conclusions.

Figure 2: Four examples in current benchmarks to showcase the problem of visual safety information leakage. The leakage information from visual to textual is marked as red.

Construction of VLSBench

To address these flaws in current evaluation benchmarks, the authors introduce VLSBench, a dataset specifically designed to prevent VSIL by ensuring that the textual query does not reveal sensitive information present in the image. The construction of VLSBench involves generating image-text pairs that resist visual leakage through an iterative synthesis process.

Figure 3: Overview of VLSBench construction pipeline. Our pipeline features prevent visual leakage. This pipeline includes four steps: (a) Harmful query and image description generation. (b) M: Mitigating visual leakage from textual harmful query. (c) Iterative image generation from image description. (d) Final filtration ensuring image-text pairs are matched and harmful.

Experimental Findings and Implications

The paper's experiments demonstrate that when using VLSBench, models that rely purely on textual alignment significantly underperform compared to those using multimodal alignment strategies. This clear performance gap underscores the necessity for more nuanced multimodal alignment techniques when visual safety information leakage is adequately controlled.

Figure 4: Textual SFT compared with Multimodal SFT on our VLSBench. Dash lines mean the average safety rate on the three base models.

Conclusion

In concluding, the paper emphasizes that addressing VSIL is crucial for realistic multimodal safety evaluations. The introduction of VLSBench is a step toward this goal, providing a challenging environment that accurately reflects the complexities of real-world multimodal data without the oversimplified crutch of text-only leakage. The findings suggest future research should focus on enhancing multimodal safety alignment methods to truly integrate and secure visual data in AI systems.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (5)

GitHub

Tweets

https://twitter.com/ins_bug/status/1863765110009020916

https://twitter.com/arXivGPT/status/1864374043643621617