Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models (2307.14539v2)

Published 26 Jul 2023 in cs.CR and cs.CL

Abstract: We introduce new jailbreak attacks on vision LLMs (VLMs), which use aligned LLMs and are resilient to text-only jailbreak attacks. Specifically, we develop cross-modality attacks on alignment where we pair adversarial images going through the vision encoder with textual prompts to break the alignment of the LLM. Our attacks employ a novel compositional strategy that combines an image, adversarially targeted towards toxic embeddings, with generic prompts to accomplish the jailbreak. Thus, the LLM draws the context to answer the generic prompt from the adversarial image. The generation of benign-appearing adversarial images leverages a novel embedding-space-based methodology, operating with no access to the LLM model. Instead, the attacks require access only to the vision encoder and utilize one of our four embedding space targeting strategies. By not requiring access to the LLM, the attacks lower the entry barrier for attackers, particularly when vision encoders such as CLIP are embedded in closed-source LLMs. The attacks achieve a high success rate across different VLMs, highlighting the risk of cross-modality alignment vulnerabilities, and the need for new alignment approaches for multi-modal models.

PDF HTML Abstract

Compositional Adversarial Attacks on Vision-LLMs (VLMs)

This paper presents a novel approach for executing adversarial attacks on Vision-LLMs (VLMs) by exploiting cross-modality vulnerabilities. VLMs extend traditional text-based models by integrating vision capabilities, which is increasingly common for tasks requiring both image and text understanding. While text-only attacks are well-studied, the paper explores an area less explored: creating adversarial inputs that combine text and images to circumvent safety protocols of these models. This exploration reveals significant vulnerabilities in the current alignment strategies employed by VLMs.

Approach

The proposed attack methodology utilizes compositional strategies involving both text and vision inputs. The key innovation lies in combining benign-looking adversarially crafted images with generic textual prompts to disrupt model alignment, allowing potentially harmful content generation. The attack leverages the embedding space of VLMs, specifically targeting the embedding space of the vision encoder with malicious triggers. This bypasses the textual safeguards typically implemented in VLMs, as the alignment between modalities can be manipulated to reach undesired states without direct access to the LLM component.

Methodology

The authors detail a black-box approach that does not require access to the underlying LLM. Instead, adversarial images are generated using only the vision encoder, such as CLIP, frequent in closed-source systems. These images are crafted to match target embeddings corresponding to adversarial triggers in the joint vision-language space. The embedding-based attacks demonstrate a compositional nature, where adversarial images can be reused with varying text instructions to accomplish successful jailbreaks across multiple scenarios.

Experimental Results

The experiments show high success rates when utilizing adversarial images targeted at specific types of malicious triggers. Among the various strategies tested, combining Optical Character Recognition (OCR) textual triggers with visual content proved the most effective in bypassing model safety guards. Models like LLaVA and LLaMA-Adapter V2 were evaluated, showcasing vulnerabilities inherent to their alignment strategies between image and text modalities.

From a practical standpoint, the paper highlights a concerning level of robustness within popular VLM architectures, indicating that adversarial images can effectively contaminate the context and promote the generation of unsafe or biased outputs. Human evaluations align with automatic toxicity assessments, affirming the attack's ability to produce harmful content despite existing safety measures.

Implications and Future Work

The implications of this research are profound for AI safety, revealing that current alignment methods fail to adequately address cross-modality threats. This calls for a rethink in how models are trained to align not just individual modalities but their integration. The findings suggest that aligning models holistically across all input types might mitigate such adversarial exploits more effectively.

Looking forward, the paper opens several avenues for future research, including refining adversarial image generation techniques and developing more resilient multimodal alignment strategies. A focus on embedding space understanding and cross-modality interactions will be critical as we pursue more robust, aligned AI systems. Additionally, the approach sets a foundation for further exploration of black-box attacks that exploit commonly integrated elements like vision encoders without accessing proprietary LLMs, thus lowering the entry barrier for potential threats in real-world applications.

In conclusion, this paper contributes significantly to the domain of adversarial attacks in AI by highlighting and exploiting the vulnerabilities present in cross-modality integrations of VLMs. Such insights will guide defenses against increasingly sophisticated attacks, ensuring safe and reliable deployment of AI technologies integrating vision and language processing.

PDF Markdown Bookmark Chat (Pro)

References (44)

Authors (3)

Erfan Shayegani (7 papers)
Yue Dong (61 papers)
Nael Abu-Ghazaleh (31 papers)

Citations (80)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/davilagrau/status/1759987989789237374

YouTube

Show All Videos