Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

Published 30 Nov 2024 in cs.CV | (2412.00473v5)

Abstract: With the significant advancement of Large Vision-LLMs (VLMs), concerns about their potential misuse and abuse have grown rapidly. Previous studies have highlighted VLMs' vulnerability to jailbreak attacks, where carefully crafted inputs can lead the model to produce content that violates ethical and legal standards. However, existing methods struggle against state-of-the-art VLMs like GPT-4o, due to the over-exposure of harmful content and lack of stealthy malicious guidance. In this work, we propose a novel jailbreak attack framework: Multi-Modal Linkage (MML) Attack. Drawing inspiration from cryptography, MML utilizes an encryption-decryption process across text and image modalities to mitigate over-exposure of malicious information. To align the model's output with malicious intent covertly, MML employs a technique called "evil alignment", framing the attack within a video game production scenario. Comprehensive experiments demonstrate MML's effectiveness. Specifically, MML jailbreaks GPT-4o with attack success rates of 97.80% on SafeBench, 98.81% on MM-SafeBench and 99.07% on HADES-Dataset. Our code is available at https://github.com/wangyu-ovo/MML.

Abstract PDF HTML Upgrade to Chat

Authors (5)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel cryptography-inspired Multi-Modal Linkage (MML) attack that bypasses VLM safeguards with success rates over 97%.
It employs an encryption-decryption process using techniques like word replacement and image transformations to align model outputs with malicious objectives.
Experimental results show MML outperforming existing methods even under defense mechanisms, setting a new benchmark for exploiting VLM vulnerabilities.

Introduction

The paper "Jailbreak Large Vision-LLMs Through Multi-Modal Linkage" (2412.00473) presents a novel framework for conducting jailbreak attacks on VLMs, leveraging a cryptography-inspired encryption-decryption process across text and image modalities. This framework, termed Multi-Modal Linkage (MML) Attack, addresses the limitations of over-exposing harmful content and the lack of stealthy malicious guidance prevalent in existing methods.

Figure 1: Existing structure-based attacks compared with our MML method.

Despite advancements in safety alignment, VLMs remain vulnerable to input crafted to generate content misaligned with ethical standards. The MML framework introduces a stealthy approach to manipulate model outputs, achieving high success rates across state-of-the-art VLMs.

MML Attack Framework

Encryption-Decryption Process

MML employs an encryption-decryption mechanism to obscure harmful inputs, employing four strategies: word replacement, image mirroring, image rotation, and base64 encoding. Word replacement uses a part-of-speech tagging tool to replace key nouns and adjectives with neutral terms (Figure 2).

Figure 2: Illustration of MML's image inputs using word replacement as an encryption method.

The framework first transforms malicious text into an encrypted typographical image, then guides the model during inference to decrypt this content, enhancing the covert alignment with malicious objectives.

Evil Alignment

To counter the neutral text guidance limitation, MML incorporates a technique called "evil alignment," which places the attack within a simulated video game production context. This aligns the model's outputs with malevolent goals, effectively guiding VLMs to fulfill the intended malicious instructions.

Experimental Results

Experiments on SafeBench, MM-SafeBench, and HADES-Dataset validate MML's effectiveness, achieving attack success rates of 97.80%, 98.81%, and 99.07%, respectively, against advanced models like GPT-4o.

Figure 3: Attack success rates of baselines vs. MML across different datasets.

Performance Under Defense

The paper evaluates MML under AdaShield-Static, a common defensive strategy. MML outperforms baseline methods even when defense prompts are used, indicating robust resistance against prompt-based defenses. The adaptability of the encryption-decryption process plays a crucial role, with different encryption methods showcasing varied success rates across models and scenarios.

Figure 4: Jailbreak score distribution across different defense settings.

Conclusion

MML demonstrates a sophisticated approach to exploiting VLM vulnerabilities through strategic cryptographic techniques and contextual alignment. By addressing the shortcomings of direct malicious exposure and inadequate stealthy guidance, it sets a new benchmark for attack strategies. Future research may explore the broader applicability of the MML framework, including novel encryption-decryption techniques and countermeasures to enhance model robustness without compromising performance.

Markdown Report Issue