Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Character

Published 25 May 2024 in cs.CR and cs.AI | (2405.20773v2)

Abstract: With the advent and widespread deployment of Multimodal LLMs (MLLMs), ensuring their safety has become increasingly critical. To achieve this objective, it requires us to proactively discover the vulnerability of MLLMs by exploring the attack methods. Thus, structure-based jailbreak attacks, where harmful semantic content is embedded within images, have been proposed to mislead the models. However, previous structure-based jailbreak methods mainly focus on transforming the format of malicious queries, such as converting harmful content into images through typography, which lacks sufficient jailbreak effectiveness and generalizability. To address these limitations, we first introduce the concept of "Role-play" into MLLM jailbreak attacks and propose a novel and effective method called Visual Role-play (VRP). Specifically, VRP leverages LLMs to generate detailed descriptions of high-risk characters and create corresponding images based on the descriptions. When paired with benign role-play instruction texts, these high-risk character images effectively mislead MLLMs into generating malicious responses by enacting characters with negative attributes. We further extend our VRP method into a universal setup to demonstrate its generalizability. Extensive experiments on popular benchmarks show that VRP outperforms the strongest baseline, Query relevant and FigStep, by an average Attack Success Rate (ASR) margin of 14.3% across all models.

Abstract PDF HTML Upgrade to Chat

Authors (4)

Citations (11)

View on Semantic Scholar

Summary

The paper presents a novel jailbreak attack method using role-playing image characters to exploit vulnerabilities in multimodal LLMs.
It employs a five-step pipeline, including character description using Chain of Thought and image generation via diffusion models.
Experimental results show a 14.3% higher attack success rate compared to baselines, highlighting significant security risks.

Visual-RolePlay: Universal Jailbreak Attack on MultiModal LLMs via Role-playing Image Character

Introduction

The paper "Visual-RolePlay: Universal Jailbreak Attack on MultiModal LLMs via Role-playing Image Character" (2405.20773) investigates the vulnerabilities of Multimodal LLMs (MLLMs) by introducing a novel jailbreak attack method named Visual Role-play (VRP). The advancement in MLLMs necessitates a focus on security as these models are susceptible to providing harmful content when prompted with malicious queries. Previous jailbreak methods primarily transformed malicious text into images or altered their format, which lacked robustness and generalizability. In contrast, VRP leverages character-driven imagery to mislead MLLMs into generating malicious responses by involving role-play techniques.

Methodology

The core concept of VRP revolves around generating and integrating detailed bogus characters into images to lure MLLMs into acting those roles through crafted benign text instructions. This structure-based approach utilizes a five-step pipeline for creating adversarial text-image pairs:

Character Description Generation: The VRP method initiates by generating a detailed description of a high-risk character using Chain of Thought techniques to enrich the LLM-generated attributes.
Character Image Creation: Images of high-risk characters are produced using diffusion models like Stable Diffusion, embedding crucial traits in typographic form onto the image.
Typography of Malicious Queries: Original malicious queries are embedded into the image to ensure the intent is conveyed accurately, enhancing attack efficacy.
Image Concatenation: The character images, description typography, and question typography are concatenated to form a comprehensive adversarial input.
MLLM Attack Execution: The crafted image is supplied alongside benign text role-play instructions, compelling the MLLM to enact the high-risk character, resulting in the model generating malicious content.
Figure 1: The Pipeline of Query-specific VRP.

Experimental Results

The VRP demonstrated superior jailbreak efficacy across various MLLMs compared to existing methods, achieving a 14.3% higher Attack Success Rate (ASR) than the strongest baselines on average. The robustness of VRP was tested under universal settings, showcasing its ability to generalize across various malicious queries without extensive customization.

Implications and Future Work

The VRP approach underscores significant implications for both the theoretical understanding and practical security applications of MLLMs. It reveals deep-seated structural vulnerabilities that necessitate advanced defense mechanisms to mitigate such sophisticated attacks. Future work could enhance VRP by employing advanced character generation strategies and further integrating these insights into developing robust safeguard strategies tailored to handle novel attack paradigms.

Conclusion

In conclusion, the Visual Role-play method serves as a pivotal contribution to understanding and challenging the robustness of MLLMs against jailbreak attacks. By exploiting role-playing dynamics visually, VRP not only advances the current landscape of MM-Jailbreak methods but also establishes a foundation for future research in safeguarding AI models against increasingly intricate adversarial exploits.

Markdown Report Issue