MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models (2311.17600v5)

Published 29 Nov 2023 in cs.CV

Abstract: The security concerns surrounding LLMs have been extensively explored, yet the safety of Multimodal LLMs (MLLMs) remains understudied. In this paper, we observe that Multimodal LLMs (MLLMs) can be easily compromised by query-relevant images, as if the text query itself were malicious. To address this, we introduce MM-SafetyBench, a comprehensive framework designed for conducting safety-critical evaluations of MLLMs against such image-based manipulations. We have compiled a dataset comprising 13 scenarios, resulting in a total of 5,040 text-image pairs. Our analysis across 12 state-of-the-art models reveals that MLLMs are susceptible to breaches instigated by our approach, even when the equipped LLMs have been safety-aligned. In response, we propose a straightforward yet effective prompting strategy to enhance the resilience of MLLMs against these types of attacks. Our work underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source MLLMs against potential malicious exploits. The resource is available at https://github.com/isXinLiu/MM-SafetyBench

PDF Abstract

Overview of "MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal LLMs"

The research paper, "MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal LLMs," presents an in-depth paper focusing on the underexplored aspect of safety within Multimodal LLMs (MLLMs). Given the increasing integration of multimodal capabilities into LLMs, safety concerns become paramount, especially when models interact with various data forms, such as images and text. The authors introduce MM-SafetyBench, a comprehensive framework designed for the safety-critical evaluation of MLLMs against manipulative, query-relevant images.

Motivation and Methodology

The impetus behind this paper stems from recognizing the vulnerability of MLLMs when confronted with images tied to malicious queries. Unlike traditional LLMs, which are extensively studied in terms of security, MLLMs' multimodal nature presents new attack vectors, particularly through the ingenious use of imagery to bypass existing safety measures. To address this, the authors constructed MM-SafetyBench, a dataset comprising 13 scenarios encompassing 5,040 text-image pairs specifically curated for evaluating MLLM safety.

Key aspects of MM-SafetyBench's construction include:

Question Generation: Utilizing OpenAI's GPT-4, the authors generated questions that are inherently unsafe across various scenarios. This step ensures that the dataset targets commonly acknowledged unsafe domains.
Image Creation: Three distinct methods—Stable Diffusion, Typography, and a combination of both—were employed to visually depict key phrases extracted from these questions.
Evaluation Metrics: The paper employs both Attack Success Rate (ASR) and Refusal Rate (RR) as metrics to assess the robustness of MLLMs against this newly crafted benchmark.

Key Findings

The research revealed that MLLMs, such as LLaVA-1.5, expressed significant susceptibility to attacks executed via query-relevant images, even when these models were deemed safety-aligned by conventional standards. Notably, typographic representation of malicious key phrases proved to be particularly effective in circumventing safety protocols, leading to sharp increases in ASR.

Furthermore, the findings illustrated that while some models exhibited high refusal rates, suggesting inherent safety, this often masked underlying issues such as poor generalization, overfitting, or inadequate Optical Character Recognition (OCR) capabilities. These findings underscore the complexity of assessing and ensuring safety in MLLMs.

Implications and Future Prospects

The implications of this paper are multifaceted. Practically, the introduction of MM-SafetyBench provides a pivotal tool for developers and researchers intent on fortifying the safety frameworks of MLLMs. The benchmarking tool paves the way for more nuanced safety protocols that consider the interplay of visuals and text, an aspect becoming increasingly relevant in AI applications spanning from interactive agents to educational tools.

Theoretically, the work points to a pressing need to develop MLLMs that seamlessly integrate safety checks across modalities without compromising the model's flexibility and generalization capacity. Additionally, the proposed safety prompt highlights an effective ephemeral measure to mitigate attack success, underscoring the potential of prompt engineering as a countermeasure.

Future developments in AI, particularly MLLMs, would benefit from this paper by adopting more robust alignment techniques that address multimodality explicitly. Further exploration into more dynamic safety prompts and adaptive alignment protocols could yield substantial benefits in crafting MLLMs that are both powerful and secure across various application domains. The balance between model efficacy and security remains a critical focus as AI technologies continue to evolve.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Xin Liu (820 papers)
Yichen Zhu (51 papers)
Jindong Gu (101 papers)
Yunshi Lan (30 papers)
Chao Yang (333 papers)
Yu Qiao (563 papers)

Citations (38)

View on Semantic Scholar

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models (2311.17600v5)

Overview of "MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal LLMs"

Motivation and Methodology

Key Findings

Implications and Future Prospects

Related Papers

GitHub

YouTube