Papers
Topics
Authors
Recent
2000 character limit reached

JPS: Jailbreak Multimodal Large Language Models with Collaborative Visual Perturbation and Textual Steering (2508.05087v1)

Published 7 Aug 2025 in cs.MM, cs.AI, cs.CL, and cs.CR

Abstract: Jailbreak attacks against multimodal LLMs (MLLMs) are a significant research focus. Current research predominantly focuses on maximizing attack success rate (ASR), often overlooking whether the generated responses actually fulfill the attacker's malicious intent. This oversight frequently leads to low-quality outputs that bypass safety filters but lack substantial harmful content. To address this gap, we propose JPS, \underline{J}ailbreak MLLMs with collaborative visual \underline{P}erturbation and textual \underline{S}teering, which achieves jailbreaks via corporation of visual image and textually steering prompt. Specifically, JPS utilizes target-guided adversarial image perturbations for effective safety bypass, complemented by "steering prompt" optimized via a multi-agent system to specifically guide LLM responses fulfilling the attackers' intent. These visual and textual components undergo iterative co-optimization for enhanced performance. To evaluate the quality of attack outcomes, we propose the Malicious Intent Fulfillment Rate (MIFR) metric, assessed using a Reasoning-LLM-based evaluator. Our experiments show JPS sets a new state-of-the-art in both ASR and MIFR across various MLLMs and benchmarks, with analyses confirming its efficacy. Codes are available at \href{https://github.com/thu-coai/JPS}{https://github.com/thu-coai/JPS}. \color{warningcolor}{Warning: This paper contains potentially sensitive contents.}

Summary

  • The paper presents JPS, a novel method that decouples safety bypass and malicious intent steering in MLLMs.
  • It utilizes target-guided visual perturbation via PGD and a multi-agent textual system to optimize harmful response generation.
  • JPS outperforms prior methods by achieving over 92% ASR and 87% MIFR, demonstrating robust effectiveness against advanced defenses.

Jailbreak Multimodal LLMs with Collaborative Visual Perturbation and Textual Steering

Introduction

The paper presents JPS, a methodology designed to execute jailbreak attacks on Multimodal LLMs (MLLMs). These attacks aim to bypass the safety filters embedded in MLLMs and fulfill harmful intents formulated by the attacker. The primary novelty of JPS lies in its synergistic use of visual perturbations and textual steering, optimized to maximize both Attack Success Rate (ASR) and Malicious Intent Fulfillment Rate (MIFR). This approach is a significant departure from previous methods that primarily focused on ASR without ensuring the malicious intent was fully realized. Figure 1

Figure 1: Failure modes of jailbreak responses that successfully bypass safety but lack attack utility.

Methodology

Decoupling Strategy for Safety Bypass and Quality Steering

JPS leverages a two-pronged approach: visual perturbation to undermine safety mechanisms and textual prompts to steer the model's response towards specific harmful intents. This decoupling allows each component to be fine-tuned independently, enhancing overall effectiveness.

  • Visual Perturbation: Utilizes target-guided adversarial image perturbations refined through a Projected Gradient Descent (PGD) method. The perturbations are designed to be transferable, facilitating safety bypass across varying contexts without compromising resource economy.
  • Textual Steering: Implements a Multi-Agent System (MAS) consisting of roles like Judger, Summarizer, and Revisor to iteratively refine the steering prompt. This system ensures high fidelity in both Instruction Following and Content Harmfulness dimensions. Figure 2

    Figure 2: Overview of the JPS which iteratively alternates between (1) optimizing target-guided image perturbations for safety bypassing and (2) refining the steering prompt via a Multi-Agent System for malicious intent fulfillment in responses.

Evaluation Metrics

The paper introduces MIFR, a metric designed to complement ASR by assessing the extent to which the generated responses genuinely fulfill the malicious intent. MIFR is calculated via a reasoning LLM-based evaluation pipeline that scrutinizes both the instruction adherence and the actionable utility of the responses.

Experimental Results

JPS sets a new standard in both ASR and MIFR metrics, outperforming extant methods across various benchmarks and MLLM architectures:

  • Performance Metrics: On models like InternVL2 and MiniGPT-4, JPS achieved ASR of over 92% and MIFR exceeding 87%. These results underscore the robustness of JPS's decoupled strategy in aligning outputs with specific harmful intents effectively. Figure 3

    Figure 3: Analysis of target-guided optimization. Target guidance leads to lower loss and faster convergence (top), and achieves near-perfect matching of the target affirmative prefix (bottom) compared to optimization without guidance.

Robustness Against Defenses

When tested against defense mechanisms like Adashield-A and ESCO, JPS maintained high effectiveness, illustrating its resilience. This robustness is attributed to the adaptive nature of its co-optimization framework, which ensures versatility across various adversarial contexts.

Conclusion

JPS presents a refined pipeline for executing high-fidelity jailbreak attacks on MLLMs, streamlining the adversary's ability to bypass conventional safeguards and realize malicious intents fully. The introduction of MIFR offers a more comprehensive understanding of attack efficacy from an adversarial standpoint. Future directions may involve enhancing defense mechanisms to counter these sophisticated attacks and applying the decoupling strategy to other domains or applications within AI security frameworks.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.