Image Jailbreaking Prompt (imgJP)

Updated 12 September 2025

Image jailbreaking prompts (imgJP) are deliberately engineered images designed to bypass multimodal model safety using visual perturbations and training poisoning.
They leverage gradient-based optimization, steganography, and decoupled co-optimization to overcome content filters and achieve high attack success rates.
These adversarial strategies expose critical security and ethical vulnerabilities, emphasizing the urgent need for robust, holistic defenses in multimodal AI.

An image jailbreaking prompt (imgJP) is a deliberately crafted image input designed to bypass the safety and alignment mechanisms of multimodal LLMs (MLLMs) or vision-LLMs (VLMs), inducing them to generate harmful or objectionable outputs in response to otherwise protected or refused queries. Unlike conventional text-based jailbreaks, imgJPs exploit the visual modality to manipulate the model’s internal representations and overcome content filters, sometimes acting independently or synergistically with text prompts. The introduction of imgJP stems from the increasing integration of visual perception and language understanding in advanced generative AI systems, exposing new attack surfaces and transfer pathways for adversarial exploitation.

1. Core Principles and Definition

An imgJP is an image intentionally optimized—either via targeted perturbation, poisoning, steganographic embedding, or compositional manipulation—to subvert MLLM or VLM safety guardrails. The attack can occur in both inference-time (test-time adversarial perturbation) and training-time (poisoning) scenarios:

Test-Time imgJP: The image is directly manipulated post-training by optimizing pixel values, subject to norm constraints, such that when paired with a harmful or even innocuous query, the model’s generative response is prodded into violating safety boundaries (Niu et al., 4 Feb 2024, Kim et al., 26 May 2025).
Training-Time imgJP (ImgTrojan): Data poisoning associates a clean image with a malicious jailbreak prompt as the caption during fine-tuning. The contaminated datapoint acts as a durable “trigger,” later allowing the image to bypass restrictions in downstream use (Tao et al., 5 Mar 2024).

The fundamental mechanism exploits the cross-modal fusion within MLLMs, causing misaligned, unsafe, or policy-violating responses in settings where normal content filters would block requests.

2. Optimization and Attack Methodologies

The construction and deployment of imgJPs leverage diverse algorithmic strategies, each grounded in the maximization of attack success rate (ASR) and, in advanced settings, the malicious intent fulfillment rate (MIFR):

Gradient-Based Maximum Likelihood: As in (Niu et al., 4 Feb 2024), the imgJP is optimized via Projected Gradient Descent (PGD) to maximize the likelihood that, given any harmful query $q_i$ , the model produces a target harmful answer $a_i$ :

$\max_{x_{\text{jp}}} \sum_{i=0}^M \log p(a_i \mid q_i, x_{\text{jp}})$

with $x_{\text{jp}} \in [0, 255]^d$ constrained by pixel bounds.

Data Poisoning (ImgTrojan): This approach injects a trojan image/text pair into the training set, where the benign image is associated with a malicious jailbreaking prompt, enabling later inference-time attacks with that poisoned image—even at minuscule poison ratios (sometimes as low as 0.01%) (Tao et al., 5 Mar 2024).
Compositional and Programmatic Reasoning: Recent frameworks (e.g., PRISM (Zou et al., 29 Jul 2025)) decompose a complex harmful instruction $\mathcal{H}$ into a sequence of benign “visual gadgets,” each processed by the model as harmless. The attack is realized by a control-flow textual prompt that orchestrates the model’s reasoning across these gadgets, triggering the synthesis of a coherent harmful output only in aggregate.
Steganographic and Concealed Attacks: Implicit jailbreaks embed adversarial instructions in the least significant bits (LSB) of image pixels, coupling the image with innocuous textual prompts. This exploits cross-modal reasoning: the overt text is benign, while the image stealthily delivers the intended harmful signal, decoded and actioned during inference (Wang et al., 22 May 2025):

$I'(h, w, c) = (I(h, w, c) %%%%5%%%% 11111110) | m_t$

Decoupled Co-Optimization (JPS): Visual perturbation and textual steering are separately optimized and then iteratively fused. The adversarial image is refined to maximize bypass potential, while the “steering” text is refined by a multi-agent system to ensure instruction following and content harmfulness (Chen et al., 7 Aug 2025).

3. Universality, Transferability, and Robustness

Empirical results across several studies underscore the universal and transferable nature of imgJP attacks:

Data-Universal Attacks: A single crafted imgJP (or a universal perturbation $\delta_{\text{JP}}$ ) generalized to bypass safety for a wide variety of unseen queries and images, exhibiting both prompt-universality and image-universality (Niu et al., 4 Feb 2024).
Model-Transferability: imgJPs optimized on one MLLM or VLM (often termed a surrogate) are shown to be effective against other, possibly black-box, target models. This includes cross-architecture transfer (MiniGPT → LLaVA, InstructBLIP, mPLUG-Owl2) and ensemble-based optimization for improved robustness (Niu et al., 4 Feb 2024, Ying et al., 6 Jun 2024).
Black-Box Feasibility: Many imgJP schemes, including those involving surreptitious steganographic embedding (Wang et al., 22 May 2025) or RL-based prompt query optimization (Chen et al., 25 May 2025), achieve high attack success rates on commercial, closed-source systems (e.g., GPT-4o, Gemini-1.5 Pro) with minimal queries.

Attack Category	Key Optimization Target	Noted Transferability
Gradient-Based imgJP	Maximum likelihood of target harmful output	High (across models/prompts)
Data Poisoning (ImgTrojan)	Training set mapping (img, JBP)	Universal if trigger reused
ROP/Compositional (PRISM)	Sequential reasoning/composition	Succeeds on reasoning-based
Steganographic (IJA)	Concealed embedding + template optimization	High (cross-arch transfer)

A prominent evolution involves mutually reinforcing attacks across visual and textual modalities:

Bi-Modal Adversarial Prompting (BAP): By jointly optimizing both an adversarial image and its paired textual prompt—using chain-of-thought reasoning to iteratively refine the latter—attackers achieve substantially higher ASR, especially against models that fuse their cross-modal embeddings for safety (Ying et al., 6 Jun 2024).
Hybrid Benign-to-Toxic Jailbreaks: Recent innovations optimize images to trigger harmful output even when paired with fully benign prompts. This Benign-to-Toxic (B2T) paradigm reveals emergent cross-modal vulnerabilities and can further amplify attack success when combined with text-derived triggers (Kim et al., 26 May 2025).
Dynamic Prompts & Indicator Injection: Approaches such as GhostPrompt employ RL-based dynamic insertion of “safety indicators” (logos, QR codes) to help the adversarial text prompt bypass image-level moderation, while the text undergoes optimization to ensure semantic alignment and filter circumvention (Chen et al., 25 May 2025).

5. Evaluation Metrics and Experimental Outcomes

The efficacy of imgJP methodologies is rigorously quantified using both legacy and novel metrics:

Attack Success Rate (ASR): Fraction of harmful or adversarial queries that elicit a non-safety-filtered, policy-violating output. Studies consistently report substantial increases (up to +29% in BAP (Ying et al., 6 Jun 2024), >90% in IJA (Wang et al., 22 May 2025), and up to 99% in GhostPrompt (Chen et al., 25 May 2025)).
Malicious Intent Fulfillment Rate (MIFR): Introduced by JPS (Chen et al., 7 Aug 2025), this metric moves beyond filter evasion, quantifying the fraction of responses that actually realize the core, actionable intent of the harmful query. This is evaluated with Reasoning-LLM–based analyses, highlighting that many prior methods delivered non-refusal responses but with limited substantive compliance.
Transfer Rates and Stealthiness: Robustness to post-training patching, clustering and similarity analyses (e.g., with CLIP or moderation classifiers), and stealthy behavior on benign tasks (e.g., undetectability by typical CLIP-based similarity checks (Tao et al., 5 Mar 2024)).

Metric	Description	Observed Range in imgJP Literature
ASR	Harmful output after attack	Up to 99% (modern frameworks)
MIFR	Fulfillment of intended harm	86.5% in SOTA attacks (JPS (Chen et al., 7 Aug 2025))
Prompt Stealth	Bypass without utility regression	Minimal loss on captioning/caption metrics

6. Security, Ethical, and Societal Implications

The development and empirical validation of imgJP attack methodologies reveal fundamental gaps in contemporary alignment and safety architectures:

Expanded Attack Surface: The inclusion of visual modalities increases complexity, with attacks exploiting the interplay between vision and language encoders to circumvent otherwise robust filter systems (Niu et al., 4 Feb 2024, Tao et al., 5 Mar 2024).
Stealth and Scalability: Approaches employing imperceptible image-space perturbations and training set poisoning can scale adversarial exposure—hijacking benign user prompts and misleading both users and operators (Chen et al., 8 Apr 2025).
Robustness Challenges: Universal, transferable attacks demonstrate that piecemeal fixes (e.g., CLIP similarity checks, defensive prompt patches) are insufficient, necessitating advances in multi-modal alignment, steganalysis, adversarial training, and chain-of-thought monitoring.
Practical Threats: The success of low-effort, language-only prompt strategies underscores the urgency for context- and intent-aware moderation that tracks narrative context, implication, and cross-modal interactions (Mustafa et al., 29 Jul 2025).
Responsible Disclosure: Numerous studies caution that offensive content can be generated, with benchmark datasets (e.g., UTCB (Nair et al., 7 May 2025)) and adversarial prompt corpora released only under controlled-access or with filenames explicitly redacting sensitive outcomes.

7. Emerging Defenses and Future Directions

Research efforts have begun to address these vulnerabilities with a spectrum of defense mechanisms:

Prompt and Output Filtering: Keyword, semantic, and cross-modal matching filters; however, these are frequently bypassed by lexical camouflage, steganography, or compositional attacks (Tao et al., 5 Mar 2024, Nair et al., 7 May 2025).
Chain-of-Thought and Internal Simulation Defenses: “Think Twice Prompting” and its variants prompt the model to internally describe and evaluate planned outputs for safety before final generation, achieving up to 97% reduction in attack success (Wang et al., 4 Oct 2024).
Adversarial Training & Robust Encoding: Hardening the feature extractors (e.g., CLIP via FARE adversarial training) reduces the efficacy of hijacking attacks that exploit image encoders underlying IP-Adapters (Chen et al., 8 Apr 2025).
Dynamic, Multimodal Verification: GhostPrompt and JPS highlight the need for dynamical, feedback-driven defense layers that adapt to both textual and image-side adversarial signals (Chen et al., 25 May 2025, Chen et al., 7 Aug 2025).
Benchmarking and Red-Teaming: Datasets such as UTCB and CoJ-Bench support the systematic evaluation and ongoing improvement of safety and alignment systems against adversarially generated imgJPs (Nair et al., 7 May 2025, Wang et al., 4 Oct 2024).

Overall, imgJP research exposes latent, multimodal vulnerabilities in state-of-the-art generative models and underlines the need for holistic, context-aware, and chain-of-reasoning–aware defense architectures capable of countering sophisticated, evolving, and multimodal adversarial strategies.