Papers
Topics
Authors
Recent
Search
2000 character limit reached

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Published 6 Jun 2024 in cs.CV and cs.CR | (2406.04031v2)

Abstract: In the realm of large vision LLMs (LVLMs), jailbreak attacks serve as a red-teaming approach to bypass guardrails and uncover safety implications. Existing jailbreaks predominantly focus on the visual modality, perturbing solely visual inputs in the prompt for attacks. However, they fall short when confronted with aligned models that fuse visual and textual features simultaneously for generation. To address this limitation, this paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. Initially, we adversarially embed universally harmful perturbations in an image, guided by a few-shot query-agnostic corpus (e.g., affirmative prefixes and negative inhibitions). This process ensures that image prompt LVLMs to respond positively to any harmful queries. Subsequently, leveraging the adversarial image, we optimize textual prompts with specific harmful intent. In particular, we utilize a LLM to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts through a feedback-iteration manner. To validate the efficacy of our approach, we conducted extensive evaluations on various datasets and LVLMs, demonstrating that our method significantly outperforms other methods by large margins (+29.03% in attack success rate on average). Additionally, we showcase the potential of our attacks on black-box commercial LVLMs, such as Gemini and ChatGLM.

Citations (12)

Summary

  • The paper introduces the Bi-Modal Adversarial Prompt Attack, perturbing both visual and textual inputs to bypass LVLM safeguards.
  • It employs query-agnostic image perturbation and intent-specific textual optimization using Chain-of-Thought reasoning and PGD.
  • Experiments on LLaVA, MiniGPT4, and InstructBLIP demonstrate a 29% increase in attack success, exposing critical vulnerabilities in LVLMs.

Jailbreak Vision LLMs via Bi-Modal Adversarial Prompt

The paper "Jailbreak Vision LLMs via Bi-Modal Adversarial Prompt" presents a novel framework for crafting jailbreak attacks on Large Vision LLMs (LVLMs) that are trained to process and integrate information from both visual and textual inputs simultaneously.

Introduction

Incorporating vision into LLMs has resulted in the development of LVLMs, capable of processing tasks such as image captioning, visual question answering, and image retrieval. However, these models often deviate from intended functions, generating misleading or harmful outputs. Existing jailbreak attacks mainly target the visual modality, limiting their efficacy against models integrating both modalities.

To address this gap, the paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which simultaneously perturbs both visual and textual prompts to perform jailbreaks, surpassing existing methods significantly with an average increase of 29.03% in attack success rate.

Framework Overview

The BAP framework consists of two main components:

  1. Query-Agnostic Image Perturbation: This involves embedding universal adversarial perturbations in visual inputs to increase the likelihood of a model generating a positive response, irrespective of the textual prompt’s intent.
  2. Intent-Specific Textual Optimization: This step refines the textual prompts using Chain-of-Thought (CoT) reasoning to manipulate the model's output towards specific harmful intents.

This approach is especially potent because it addresses the limitations of previous methods that relied on perturbing only one modality. Figure 1

Figure 1: Our BAP framework includes two primary modules, i, query-agnostic image perturbing and intent-specific textual optimization, which individually add perturbations to visual and textual prompts. The optimized prompt pairs will induce target LVLMs to generate harmful responses.

Implementation Details

Query-Agnostic Image Perturbation

Adversarial perturbations are embedded universally in visual inputs using a corpus of affirmative expressions and inhibitory phrases, allowing the model to overlook the harmful intent of textual queries and respond positively. This optimization is achieved using Projected Gradient Descent (PGD) to maximize likelihoods of target sentences.

Intent-Specific Textual Optimization

Once a universal visual adversarial prompt is in place, textual prompts are refined to better induce target harmful outputs. This employs an LLM with CoT reasoning in a feedback-iteration manner. The CoT strategy helps analyze failures in generating harmful content, guiding the generation of refined textual adversarial prompts.

Experimental Evaluation

White-box and Black-box Attacks

The study executed extensive evaluations on popular LVLMs like LLaVA, MiniGPT4, and InstructBLIP in both white-box and black-box settings. The results demonstrated BAP's superiority in generating higher attack success rates across various scenarios. Figure 2

Figure 2: Example of BAP jailbreak attacking Gemini and Qwen.

Universal Attack Capability

BAP's ability to perform universal attacks was notably significant, as evidenced in both the query-agnostic and query-dependent settings, where it outperformed baseline methods by substantial margins.

Discussion and Impact

The BAP framework presents new opportunities beyond mere vulnerability exploitation. For instance, it can be used to assess biases in LVLM outputs post-jailbreak or evaluate model robustness against adversarial attacks by altering image or text inputs. Figure 3

Figure 3

Figure 3: Evaluating bias.

Conclusion

The presented work provides a robust mechanism for conducting bi-modal adversarial attacks on LVLMs, highlighting the critical vulnerabilities of such models when subjected to dual-modality perturbations. The framework not only opens avenues for further research into LVLM security but also aids in developing more comprehensive alignment and safety measures in AI deployments. Future research could focus on enhancing gradient-free attack strategies and optimizing computational efficiencies in textual prompt optimization.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.