Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

Published 8 Dec 2024 in cs.CR and cs.AI | (2412.05934v3)

Abstract: With the rapid advancement of multimodal LLMs (MLLMs), concerns regarding their security have increasingly captured the attention of both academia and industry. Although MLLMs are vulnerable to jailbreak attacks, designing effective jailbreak attacks poses unique challenges, especially given the highly constrained adversarial capabilities in real-world deployment scenarios. Previous works concentrate risks into a single modality, resulting in limited jailbreak performance. In this paper, we propose a heuristic-induced multimodal risk distribution jailbreak attack method, called HIMRD, which is black-box and consists of two elements: multimodal risk distribution strategy and heuristic-induced search strategy. The multimodal risk distribution strategy is used to distribute harmful semantics into multiple modalities to effectively circumvent the single-modality protection mechanisms of MLLMs. The heuristic-induced search strategy identifies two types of prompts: the understanding-enhancing prompt, which helps MLLMs reconstruct the malicious prompt, and the inducing prompt, which increases the likelihood of affirmative outputs over refusals, enabling a successful jailbreak attack. HIMRD achieves an average attack success rate (ASR) of 90% across seven open-source MLLMs and an average ASR of around 68% in three closed-source MLLMs. HIMRD reveals cross-modal security vulnerabilities in current MLLMs and underscores the imperative for developing defensive strategies to mitigate such emerging risks. Code is available at https://github.com/MaTengSYSU/HIMRD-jailbreak.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents HIMRD, a novel black-box attack that decomposes malicious prompts into benign segments across modalities to evade safety defenses.
It employs a two-phase heuristic search to optimize prompt reconstruction and boost affirmative responses, enhancing attack effectiveness.
Experimental results show ASRs up to 96.57% on open-source MLLMs, underlining vulnerabilities in current cross-modal safety alignments.

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attacks for MLLMs

Motivation and Overview

Multimodal LLMs (MLLMs) present a significant expansion of conventional LLMs, integrating textual and visual modalities for richer contextual understanding and broader applicability. Such architectures, however, introduce complex risk profiles for safety and robustness, notably exposing new attack surfaces for adversarial jailbreaks. Standard jailbreaking approaches often concentrate malicious prompts in a single modality, thus simplifying detection for well-aligned models. The "Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal LLMs" proposes HIMRD, a black-box attack framework capable of distributing risk across modalities and systematically optimizing prompt inputs for maximum evasiveness and effectiveness.

Figure 1: Schematic of HIMRD attack distributing malicious instructions across text and image, effectively subverting multimodal alignment safeguards in MLLMs such as GPT-4o.

Attack Formulation and Threat Model

The HIMRD attack operates under a pure black-box setting, assuming adversarial access solely to model outputs. The central insight is the decomposition of a malicious prompt $t$ into two distinct and independently harmless segments $t_1$ , $t_2$ , such that $t = t_1 + t_2$ and $J(t_1) = J(t_2) = 0$ , $J(t) = 1$ , where $J(\cdot)$ is a harmfulness judge function. These segments are then embedded into text and image modalities respectively, guaranteeing that no input individually exhibits harmful semantics.

The optimization goal is:

$\max_{r_{adv} \in \mathbb{R}} \log p(y_t | r_{adv})$

with the adversarial input $r_{adv}$ constructed as:

$r_{adv} = \psi(x_v \oplus \phi_v(t_1), x_t \oplus \phi_t(t_2))$

$\psi$ is the multimodal fusion, while $\phi_v, \phi_t$ are risk distribution strategies for vision and text, respectively.

Multimodal Risk Distribution Strategy

Drawing conceptual parallels to DRA-style prompt distribution, HIMRD extends the methodology to multimodal space. Importantly, an auxiliary LLM is leveraged both for syntactically splitting malicious prompts with semantically innocuous boundaries and for generating highly relevant image captions, subsequently rendered via T2I models (SD3-Medium). Typographic embedding (Pillow, $512 \times 100$ image) is employed for visual prompts, while templated text carries partial semantics. None of the embedded segments independently triggers defense heuristics under default and custom refusal prefixes, confirming effective circumvention of input filtering.

This approach is validated with a refusal rate of $<0.02\%$ per modality, decisively outperforming single-modality attacks.

Heuristic-Induced Search: Prompt Optimization Framework

A two-phase heuristic search mechanism further refines text prompts to maximize both semantic reconstruction and affirmative output probabilities:

Understanding-Enhancing Prompts ( $p_u$ ): Iterative rewriting focused on guiding models to correctly combine distributed prompt fragments, ensuring reconstruction of the original malicious intent. Auxiliary LLMs suggest lexical, syntactic, and instruction variations, leveraging model-specific comprehension weaknesses.
Inducing Prompts ( $p_i$ ): Systematic addition of instruction-following and refusal-reduction cues (e.g., explicit requests for detailed output, negative phrasing filters), empirically boosting affirmative responses over refusals.

Algorithmically, the process iteratively alternates between optimizing $p_u$ (understanding) and $p_i$ (affirmation), guided by output discriminators ( $C_u$ , $C_j$ based on HarmBench and explicit refusal metrics). Hyperparameter choices ( $N_1, N_2 = 5$ ) ensure convergence within five iterations.

Implementation: All auxiliary LLM calls (OpenAI o1-mini) and multimodal pipeline steps are scriptable, with T2I generation, typographic assembly, and judgment routines implemented in Python. The methodology generalizes across APIs and local open-source deployments, scaling well with GPU resources (A100 80GB recommended for SD3-Medium).

Experimental Results and Quantitative Evaluation

Experiments employ the standardized SafeBench dataset, spanning seven risk categories and diverse models (seven open-source: DeepSeek-VL, LLaVA, GLM-4V-9B, Qwen-VL, Yi-VL, MiniGPT-4, etc.; three closed-source: GPT-4o, Gemini-1.5-Pro, Qwen-VL-Max).

Open-Source Models: HIMRD achieves average ASR of 90.86%, markedly higher than FigStep (63.59%) and MM-SafeBench (43.76%). For e.g., DeepSeek-VL: 94.57%, LLaVA-V1.6: 96.57%.
Closed-Source Models: HIMRD attains 68.09% ASR across closed APIs, (GPT-4o: 44.29%, Qwen-VL-Max: 95.71%). These findings highlight that HIMRD successfully penetrates state-of-the-art alignment defenses in both research and proprietary systems.

Ablation studies reveal that heuristic-induced search is a dominant contributor to ASR, with up to 19.72% improvement in LLaVA-V1.5 through $p_u$ and $p_i$ optimization. Radar chart visualizations (Supplementary Figures) demonstrate consistent broad-category effectiveness, though performance in the Pornography category is less optimal, suggesting an avenue for further prompt engineering research.

Practical and Theoretical Implications

The results substantiate the contradictory claim that state-of-the-art multimodal alignment mechanisms remain susceptible to risk-distributed attacks, even with robust input/output filtering and modal-specific defenses. The HIMRD framework exposes limitations in current cross-modal safety alignment algorithms, which fail to jointly reason about distributed semantic composition.

Practically, this demonstrates a substantial threat to real-world deployments, particularly in conversational agents, API integrations, and autonomous systems. Automated auxiliary LLM orchestration, as utilized in HIMRD, presents scalable strategies for red-teaming, benchmarking, and adversarial robustness testing across model releases.

Theoretically, the work suggests that multimodal safety alignment must evolve toward holistic risk aggregation and compositional harm detection, not relying solely on per-modal heuristics or refusal prompt engineering. There remains a gap in the development of cross-modal semantic parsers capable of joint harmfulness evaluation.

Future Directions

Joint Harmfulness Contextualization: Enhancing models to reason about distributed adversarial semantics and reconstruct intent at fusion layers.
Defensive Prompt Engineering: Developing refusal systems that generalize across compositional multimodal attacks, possibly leveraging multi-agent discriminators.
Combinatorial Attack Benchmarks: Expanding datasets (beyond SafeBench) and attack strategies to cover broader semantic and visual compositions for comprehensive evaluation.
Automated Red-Teaming Systems: Integrating auxiliary LLMs for real-time adversarial scenario generation and prompt optimization pipelines.

Conclusion

The HIMRD attack framework decisively advances the state-of-the-art in black-box multimodal jailbreaks, demonstrating that cross-modal risk distribution combined with heuristic-induced prompt optimization can circumvent alignment mechanisms in both open and closed-source MLLMs. The methodology generalizes well across model families and threat categories, and ablation analysis confirms the contribution of each component. These findings highlight the necessity for future model alignment research to prioritize compositional and multimodal harm detection rather than isolated per-modal safeguards.

Markdown