Adversarial Prompt Injection Attack on Multimodal Large Language Models

Published 31 Mar 2026 in cs.CV and cs.AI | (2603.29418v1)

Abstract: Although multimodal LLMs (MLLMs) are increasingly deployed in real-world applications, their instruction-following behavior leaves them vulnerable to prompt injection attacks. Existing prompt injection methods predominantly rely on textual prompts or perceptible visual prompts that are observable by human users. In this work, we study imperceptible visual prompt injection against powerful closed-source MLLMs, where adversarial instructions are embedded in the visual modality. Our method adaptively embeds the malicious prompt into the input image via a bounded text overlay to provide semantic guidance. Meanwhile, the imperceptible visual perturbation is iteratively optimized to align the feature representation of the attacked image with those of the malicious visual and textual targets at both coarse- and fine-grained levels. Specifically, the visual target is instantiated as a text-rendered image and progressively refined during optimization to more faithfully represent the desired semantics and improve transferability. Extensive experiments on two multimodal understanding tasks across multiple closed-source MLLMs demonstrate the superior performance of our approach compared to existing methods.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces CoTTA, a framework that achieves imperceptible adversarial prompt injection through a covert textual trigger combined with bounded perturbations.
It fuses learnable textual overlays with dynamic dual-target alignment to significantly improve attack success rates and semantic precision.
Experimental results demonstrate that CoTTA outperforms existing methods with up to 81% ASR and notable gains in AvgSim on state-of-the-art MLLMs.

Adversarial Prompt Injection Attacks on Multimodal LLMs: CoTTA Framework and Empirical Analysis

Introduction

This paper introduces a rigorous study of imperceptible adversarial prompt injection (API) attacks targeting multimodal LLMs (MLLMs). The critical insight is the formulation of APIs that operate via the image modality, with injected prompts hidden through bounded, covert perturbations crafted to induce specific, attacker-chosen outputs while remaining visually inconspicuous. Existing MLLM attacks focus either on adversarial image perturbations with limited output controllability or typographic visual prompt injection that is detectable by human users. This work unifies high output controllability and visual stealth by proposing the Covert Triggered dual-Target Attack (CoTTA), which adaptively fuses an invisible textual trigger with adversarial perturbations and leverages dynamic dual-target cross-modal representation alignment.

Figure 1: Illustration of adversarial prompt injection attacks, where the adversary manipulates the behavior of MLLMs through imperceptible visual prompt injection.

Attack Paradigm: CoTTA

CoTTA is instantiated as a targeted black-box adversarial prompt injection pipeline. Its main innovation is the integration of two mechanisms: a learnable covert textual overlay (serving as a stealth prompt injection vector), and bounded adversarial noise to the RGB input image.

Figure 2: Overview of the CoTTA architecture, showing the adaptive covert trigger and adversarial perturbation components being jointly optimized to align the manipulated image with both the target text and a dynamic target image representation.

The covert trigger is realized as a center-anchored textual mask (rendered from the target prompt), whose scale and rotation are optimized. This mask is embedded into the image prior to adversarial noise addition. Adversarial perturbations are then iteratively computed to further move the image’s multimodal representation toward the joint textual and visual semantic target, enabling precise targeted adversarial control.

A dynamic dual-target alignment scheme is introduced: feature objectives are defined both with respect to the desired textual embedding and an adaptive visual target. The visual target, initialized as a rendered image of the malicious prompt, is itself refined via an I-FGSM-based iterative process, and its representation is simultaneously encouraged to match the target text and remain distinct from the current adversarial image. Alignment is performed at both the global feature and local token levels to enhance transferability and fine-grained semantic steering.

Experimental Evaluation

Setup

Experiments are performed against four state-of-the-art closed-source MLLMs (GPT-4o, GPT-5, Gemini-2.5, Claude-4.5) in pure black-box mode. The two benchmark tasks are image captioning and visual question answering (VQA); metrics include attack success rate (ASR) and average semantic similarity (AvgSim) with respect to attacker-specified targets. Both soft (reference caption match) and hard (string-exact malicious prompt match) criteria are computed, using the LLM-as-a-judge methodology.

Main Results

CoTTA exhibits substantially increased ASR and semantic precision over all competitive baselines, especially against GPT-family and Gemini-2.5 models. For example, on GPT-4o image captioning, CoTTA achieves 81% ASR/0.442 AvgSim (soft) and 74% ASR/0.339 AvgSim (hard); this is a 31.5% ASR and 0.18 AvgSim improvement over FOA-Attack. On Gemini-2.5, ASRs reach 81% (hard) and 79% (soft). Claude-4.5, while most robust overall, is still best attacked by CoTTA.

Sample comparisons (Figure 3) demonstrate that CoTTA induces near-imperceptible modifications yet reliably implants coercive prompts. MLLMs output not just semantically similar but often exact adversary-desired strings under non-cooperative user prompts.

Figure 3: Visualization of adversarial examples and perturbations: CoTTA modifications remain visually subtle while other attack patterns are more noticeable and less effective.

On VQA, CoTTA achieves 82% ASR/0.820 AvgSim (GPT-4o) and 79% ASR/0.787 AvgSim (Gemini-2.5) under the hard criterion—well above the best prior methods.

A large-scale experiment over 1000 images further confirms stability: CoTTA outperforms FOA-Attack by 23.27% ASR and 0.136 AvgSim on average across evaluated MLLMs.

Ablation Studies

An ablation of CoTTA's components reveals that every mechanism is indispensable for state-of-the-art performance. Removing image-to-image alignment causes the most severe ASR drop (54% under soft criterion), followed by absence of the covert trigger (15% drop). Maintaining a dynamic target image and dual alignment objectives proves necessary for both attack transferability and precise command expressivity.

Figure 4: Ablation on the weight coefficients $\lambda_{pull}$ and $\lambda_{push}$ : optimal performance is achieved for $\lambda_{pull}=1$ and $\lambda_{push}=0.5$ .

Hyperparameter sweeps identify that loss weighting for pull/push objectives optimally balances successful targeted injection and cross-modal transfer.

Theoretical and Practical Implications

This work demonstrates that covert, imperceptible prompt injection can be achieved via a direct image channel, even for black-box closed-source MLLMs with alignment safeguards. The empirical results expose a significant vulnerability: malicious commands unobservable to human inspection can reliably hijack downstream model outputs. By designing a semantic bridge in the form of a dynamic covert trigger and cross-modal dual-target optimization, the attack attains high output fidelity, undermining operational defenses premised on visual transparency or text filtering.

The distinction between adversarial image perturbation and prompt injection is dissolved; through CoTTA, attackers gain arbitrary expressivity (up to the granularity of the prompt) within a bounded $\ell_\infty$ budget—a previously unachieved goal for MLLM attacks.

Future Directions

Several avenues are opened:

Adaptive defenses: Existing input filtering mechanisms and adversarial detectors may fail against such covert triggers; robust training or online detection must become cross-modal and semantically sensitive.
Attack generalization: While demonstrated on specific commercial MLLMs, the methodology is extensible to other modalities (e.g., audio) and system-level LMM integrations.
Task-specific manipulations: Beyond captioning and VQA, precise control attacks can target embodied MLLM agents, information retrieval, or decision-making tasks.
Bootstrapped API attacks: CoTTA’s iterative visual target optimization suggests that richer paraphrase ensembles and reinforcement across token-level semantics can yield even more reliable prompt hijacking.

Conclusion

The CoTTA framework sets a new standard for adversarial prompt injection against MLLMs, combining imperceptibility with precise, instruction-level control without model internals. Through covert triggers, dual-target alignment, and dynamic supervision, CoTTA demonstrates that current MLLMs are broadly susceptible to API attacks under realistic threat models. These findings underscore both urgent security risks in current deployments and the necessity for robust, semantically aware multimodal safety architectures.

Reference:

"Adversarial Prompt Injection Attack on Multimodal LLMs" (2603.29418)

Markdown Report Issue