- The paper introduces a novel cryptography-inspired Multi-Modal Linkage (MML) attack that bypasses VLM safeguards with success rates over 97%.
- It employs an encryption-decryption process using techniques like word replacement and image transformations to align model outputs with malicious objectives.
- Experimental results show MML outperforming existing methods even under defense mechanisms, setting a new benchmark for exploiting VLM vulnerabilities.
Jailbreak Large Vision-LLMs Through Multi-Modal Linkage
Introduction
The paper "Jailbreak Large Vision-LLMs Through Multi-Modal Linkage" (2412.00473) presents a novel framework for conducting jailbreak attacks on VLMs, leveraging a cryptography-inspired encryption-decryption process across text and image modalities. This framework, termed Multi-Modal Linkage (MML) Attack, addresses the limitations of over-exposing harmful content and the lack of stealthy malicious guidance prevalent in existing methods.

Figure 1: Existing structure-based attacks compared with our MML method.
Despite advancements in safety alignment, VLMs remain vulnerable to input crafted to generate content misaligned with ethical standards. The MML framework introduces a stealthy approach to manipulate model outputs, achieving high success rates across state-of-the-art VLMs.
MML Attack Framework
Encryption-Decryption Process
MML employs an encryption-decryption mechanism to obscure harmful inputs, employing four strategies: word replacement, image mirroring, image rotation, and base64 encoding. Word replacement uses a part-of-speech tagging tool to replace key nouns and adjectives with neutral terms (Figure 2).
Figure 2: Illustration of MML's image inputs using word replacement as an encryption method.
The framework first transforms malicious text into an encrypted typographical image, then guides the model during inference to decrypt this content, enhancing the covert alignment with malicious objectives.
Evil Alignment
To counter the neutral text guidance limitation, MML incorporates a technique called "evil alignment," which places the attack within a simulated video game production context. This aligns the model's outputs with malevolent goals, effectively guiding VLMs to fulfill the intended malicious instructions.
Experimental Results
Experiments on SafeBench, MM-SafeBench, and HADES-Dataset validate MML's effectiveness, achieving attack success rates of 97.80%, 98.81%, and 99.07%, respectively, against advanced models like GPT-4o.
Figure 3: Attack success rates of baselines vs. MML across different datasets.
The paper evaluates MML under AdaShield-Static, a common defensive strategy. MML outperforms baseline methods even when defense prompts are used, indicating robust resistance against prompt-based defenses. The adaptability of the encryption-decryption process plays a crucial role, with different encryption methods showcasing varied success rates across models and scenarios.
Figure 4: Jailbreak score distribution across different defense settings.
Conclusion
MML demonstrates a sophisticated approach to exploiting VLM vulnerabilities through strategic cryptographic techniques and contextual alignment. By addressing the shortcomings of direct malicious exposure and inadequate stealthy guidance, it sets a new benchmark for attack strategies. Future research may explore the broader applicability of the MML framework, including novel encryption-decryption techniques and countermeasures to enhance model robustness without compromising performance.