- The paper introduces the Bi-Modal Adversarial Prompt Attack, perturbing both visual and textual inputs to bypass LVLM safeguards.
- It employs query-agnostic image perturbation and intent-specific textual optimization using Chain-of-Thought reasoning and PGD.
- Experiments on LLaVA, MiniGPT4, and InstructBLIP demonstrate a 29% increase in attack success, exposing critical vulnerabilities in LVLMs.
Jailbreak Vision LLMs via Bi-Modal Adversarial Prompt
The paper "Jailbreak Vision LLMs via Bi-Modal Adversarial Prompt" presents a novel framework for crafting jailbreak attacks on Large Vision LLMs (LVLMs) that are trained to process and integrate information from both visual and textual inputs simultaneously.
Introduction
Incorporating vision into LLMs has resulted in the development of LVLMs, capable of processing tasks such as image captioning, visual question answering, and image retrieval. However, these models often deviate from intended functions, generating misleading or harmful outputs. Existing jailbreak attacks mainly target the visual modality, limiting their efficacy against models integrating both modalities.
To address this gap, the paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which simultaneously perturbs both visual and textual prompts to perform jailbreaks, surpassing existing methods significantly with an average increase of 29.03% in attack success rate.
Framework Overview
The BAP framework consists of two main components:
- Query-Agnostic Image Perturbation: This involves embedding universal adversarial perturbations in visual inputs to increase the likelihood of a model generating a positive response, irrespective of the textual prompt’s intent.
- Intent-Specific Textual Optimization: This step refines the textual prompts using Chain-of-Thought (CoT) reasoning to manipulate the model's output towards specific harmful intents.
This approach is especially potent because it addresses the limitations of previous methods that relied on perturbing only one modality.
Figure 1: Our BAP framework includes two primary modules, i, query-agnostic image perturbing and intent-specific textual optimization, which individually add perturbations to visual and textual prompts. The optimized prompt pairs will induce target LVLMs to generate harmful responses.
Implementation Details
Query-Agnostic Image Perturbation
Adversarial perturbations are embedded universally in visual inputs using a corpus of affirmative expressions and inhibitory phrases, allowing the model to overlook the harmful intent of textual queries and respond positively. This optimization is achieved using Projected Gradient Descent (PGD) to maximize likelihoods of target sentences.
Intent-Specific Textual Optimization
Once a universal visual adversarial prompt is in place, textual prompts are refined to better induce target harmful outputs. This employs an LLM with CoT reasoning in a feedback-iteration manner. The CoT strategy helps analyze failures in generating harmful content, guiding the generation of refined textual adversarial prompts.
Experimental Evaluation
White-box and Black-box Attacks
The study executed extensive evaluations on popular LVLMs like LLaVA, MiniGPT4, and InstructBLIP in both white-box and black-box settings. The results demonstrated BAP's superiority in generating higher attack success rates across various scenarios.
Figure 2: Example of BAP jailbreak attacking Gemini and Qwen.
Universal Attack Capability
BAP's ability to perform universal attacks was notably significant, as evidenced in both the query-agnostic and query-dependent settings, where it outperformed baseline methods by substantial margins.
Discussion and Impact
The BAP framework presents new opportunities beyond mere vulnerability exploitation. For instance, it can be used to assess biases in LVLM outputs post-jailbreak or evaluate model robustness against adversarial attacks by altering image or text inputs.

Figure 3: Evaluating bias.
Conclusion
The presented work provides a robust mechanism for conducting bi-modal adversarial attacks on LVLMs, highlighting the critical vulnerabilities of such models when subjected to dual-modality perturbations. The framework not only opens avenues for further research into LVLM security but also aids in developing more comprehensive alignment and safety measures in AI deployments. Future research could focus on enhancing gradient-free attack strategies and optimizing computational efficiencies in textual prompt optimization.