- The paper presents HIMRD, a novel black-box attack that decomposes malicious prompts into benign segments across modalities to evade safety defenses.
- It employs a two-phase heuristic search to optimize prompt reconstruction and boost affirmative responses, enhancing attack effectiveness.
- Experimental results show ASRs up to 96.57% on open-source MLLMs, underlining vulnerabilities in current cross-modal safety alignments.
Heuristic-Induced Multimodal Risk Distribution Jailbreak Attacks for MLLMs
Motivation and Overview
Multimodal LLMs (MLLMs) present a significant expansion of conventional LLMs, integrating textual and visual modalities for richer contextual understanding and broader applicability. Such architectures, however, introduce complex risk profiles for safety and robustness, notably exposing new attack surfaces for adversarial jailbreaks. Standard jailbreaking approaches often concentrate malicious prompts in a single modality, thus simplifying detection for well-aligned models. The "Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal LLMs" proposes HIMRD, a black-box attack framework capable of distributing risk across modalities and systematically optimizing prompt inputs for maximum evasiveness and effectiveness.
Figure 1: Schematic of HIMRD attack distributing malicious instructions across text and image, effectively subverting multimodal alignment safeguards in MLLMs such as GPT-4o.
The HIMRD attack operates under a pure black-box setting, assuming adversarial access solely to model outputs. The central insight is the decomposition of a malicious prompt t into two distinct and independently harmless segments t1, t2, such that t=t1+t2 and J(t1)=J(t2)=0, J(t)=1, where J(⋅) is a harmfulness judge function. These segments are then embedded into text and image modalities respectively, guaranteeing that no input individually exhibits harmful semantics.
The optimization goal is:
radv∈Rmaxlogp(yt∣radv)
with the adversarial input radv constructed as:
radv=ψ(xv⊕ϕv(t1),xt⊕ϕt(t2))
ψ is the multimodal fusion, while ϕv,ϕt are risk distribution strategies for vision and text, respectively.
Multimodal Risk Distribution Strategy
Drawing conceptual parallels to DRA-style prompt distribution, HIMRD extends the methodology to multimodal space. Importantly, an auxiliary LLM is leveraged both for syntactically splitting malicious prompts with semantically innocuous boundaries and for generating highly relevant image captions, subsequently rendered via T2I models (SD3-Medium). Typographic embedding (Pillow, 512×100 image) is employed for visual prompts, while templated text carries partial semantics. None of the embedded segments independently triggers defense heuristics under default and custom refusal prefixes, confirming effective circumvention of input filtering.
This approach is validated with a refusal rate of <0.02% per modality, decisively outperforming single-modality attacks.
Heuristic-Induced Search: Prompt Optimization Framework
A two-phase heuristic search mechanism further refines text prompts to maximize both semantic reconstruction and affirmative output probabilities:
- Understanding-Enhancing Prompts (pu): Iterative rewriting focused on guiding models to correctly combine distributed prompt fragments, ensuring reconstruction of the original malicious intent. Auxiliary LLMs suggest lexical, syntactic, and instruction variations, leveraging model-specific comprehension weaknesses.
- Inducing Prompts (pi): Systematic addition of instruction-following and refusal-reduction cues (e.g., explicit requests for detailed output, negative phrasing filters), empirically boosting affirmative responses over refusals.
Algorithmically, the process iteratively alternates between optimizing pu (understanding) and pi (affirmation), guided by output discriminators (Cu, Cj based on HarmBench and explicit refusal metrics). Hyperparameter choices (N1,N2=5) ensure convergence within five iterations.
Implementation: All auxiliary LLM calls (OpenAI o1-mini) and multimodal pipeline steps are scriptable, with T2I generation, typographic assembly, and judgment routines implemented in Python. The methodology generalizes across APIs and local open-source deployments, scaling well with GPU resources (A100 80GB recommended for SD3-Medium).
Experimental Results and Quantitative Evaluation
Experiments employ the standardized SafeBench dataset, spanning seven risk categories and diverse models (seven open-source: DeepSeek-VL, LLaVA, GLM-4V-9B, Qwen-VL, Yi-VL, MiniGPT-4, etc.; three closed-source: GPT-4o, Gemini-1.5-Pro, Qwen-VL-Max).
- Open-Source Models: HIMRD achieves average ASR of 90.86%, markedly higher than FigStep (63.59%) and MM-SafeBench (43.76%). For e.g., DeepSeek-VL: 94.57%, LLaVA-V1.6: 96.57%.
- Closed-Source Models: HIMRD attains 68.09% ASR across closed APIs, (GPT-4o: 44.29%, Qwen-VL-Max: 95.71%). These findings highlight that HIMRD successfully penetrates state-of-the-art alignment defenses in both research and proprietary systems.
Ablation studies reveal that heuristic-induced search is a dominant contributor to ASR, with up to 19.72% improvement in LLaVA-V1.5 through pu and pi optimization. Radar chart visualizations (Supplementary Figures) demonstrate consistent broad-category effectiveness, though performance in the Pornography category is less optimal, suggesting an avenue for further prompt engineering research.
Practical and Theoretical Implications
The results substantiate the contradictory claim that state-of-the-art multimodal alignment mechanisms remain susceptible to risk-distributed attacks, even with robust input/output filtering and modal-specific defenses. The HIMRD framework exposes limitations in current cross-modal safety alignment algorithms, which fail to jointly reason about distributed semantic composition.
Practically, this demonstrates a substantial threat to real-world deployments, particularly in conversational agents, API integrations, and autonomous systems. Automated auxiliary LLM orchestration, as utilized in HIMRD, presents scalable strategies for red-teaming, benchmarking, and adversarial robustness testing across model releases.
Theoretically, the work suggests that multimodal safety alignment must evolve toward holistic risk aggregation and compositional harm detection, not relying solely on per-modal heuristics or refusal prompt engineering. There remains a gap in the development of cross-modal semantic parsers capable of joint harmfulness evaluation.
Future Directions
- Joint Harmfulness Contextualization: Enhancing models to reason about distributed adversarial semantics and reconstruct intent at fusion layers.
- Defensive Prompt Engineering: Developing refusal systems that generalize across compositional multimodal attacks, possibly leveraging multi-agent discriminators.
- Combinatorial Attack Benchmarks: Expanding datasets (beyond SafeBench) and attack strategies to cover broader semantic and visual compositions for comprehensive evaluation.
- Automated Red-Teaming Systems: Integrating auxiliary LLMs for real-time adversarial scenario generation and prompt optimization pipelines.
Conclusion
The HIMRD attack framework decisively advances the state-of-the-art in black-box multimodal jailbreaks, demonstrating that cross-modal risk distribution combined with heuristic-induced prompt optimization can circumvent alignment mechanisms in both open and closed-source MLLMs. The methodology generalizes well across model families and threat categories, and ablation analysis confirms the contribution of each component. These findings highlight the necessity for future model alignment research to prioritize compositional and multimodal harm detection rather than isolated per-modal safeguards.