- The paper introduces CoTTA, a framework that achieves imperceptible adversarial prompt injection through a covert textual trigger combined with bounded perturbations.
- It fuses learnable textual overlays with dynamic dual-target alignment to significantly improve attack success rates and semantic precision.
- Experimental results demonstrate that CoTTA outperforms existing methods with up to 81% ASR and notable gains in AvgSim on state-of-the-art MLLMs.
Adversarial Prompt Injection Attacks on Multimodal LLMs: CoTTA Framework and Empirical Analysis
Introduction
This paper introduces a rigorous study of imperceptible adversarial prompt injection (API) attacks targeting multimodal LLMs (MLLMs). The critical insight is the formulation of APIs that operate via the image modality, with injected prompts hidden through bounded, covert perturbations crafted to induce specific, attacker-chosen outputs while remaining visually inconspicuous. Existing MLLM attacks focus either on adversarial image perturbations with limited output controllability or typographic visual prompt injection that is detectable by human users. This work unifies high output controllability and visual stealth by proposing the Covert Triggered dual-Target Attack (CoTTA), which adaptively fuses an invisible textual trigger with adversarial perturbations and leverages dynamic dual-target cross-modal representation alignment.
Figure 1: Illustration of adversarial prompt injection attacks, where the adversary manipulates the behavior of MLLMs through imperceptible visual prompt injection.
Attack Paradigm: CoTTA
CoTTA is instantiated as a targeted black-box adversarial prompt injection pipeline. Its main innovation is the integration of two mechanisms: a learnable covert textual overlay (serving as a stealth prompt injection vector), and bounded adversarial noise to the RGB input image.
Figure 2: Overview of the CoTTA architecture, showing the adaptive covert trigger and adversarial perturbation components being jointly optimized to align the manipulated image with both the target text and a dynamic target image representation.
The covert trigger is realized as a center-anchored textual mask (rendered from the target prompt), whose scale and rotation are optimized. This mask is embedded into the image prior to adversarial noise addition. Adversarial perturbations are then iteratively computed to further move the image’s multimodal representation toward the joint textual and visual semantic target, enabling precise targeted adversarial control.
A dynamic dual-target alignment scheme is introduced: feature objectives are defined both with respect to the desired textual embedding and an adaptive visual target. The visual target, initialized as a rendered image of the malicious prompt, is itself refined via an I-FGSM-based iterative process, and its representation is simultaneously encouraged to match the target text and remain distinct from the current adversarial image. Alignment is performed at both the global feature and local token levels to enhance transferability and fine-grained semantic steering.
Experimental Evaluation
Setup
Experiments are performed against four state-of-the-art closed-source MLLMs (GPT-4o, GPT-5, Gemini-2.5, Claude-4.5) in pure black-box mode. The two benchmark tasks are image captioning and visual question answering (VQA); metrics include attack success rate (ASR) and average semantic similarity (AvgSim) with respect to attacker-specified targets. Both soft (reference caption match) and hard (string-exact malicious prompt match) criteria are computed, using the LLM-as-a-judge methodology.
Main Results
CoTTA exhibits substantially increased ASR and semantic precision over all competitive baselines, especially against GPT-family and Gemini-2.5 models. For example, on GPT-4o image captioning, CoTTA achieves 81% ASR/0.442 AvgSim (soft) and 74% ASR/0.339 AvgSim (hard); this is a 31.5% ASR and 0.18 AvgSim improvement over FOA-Attack. On Gemini-2.5, ASRs reach 81% (hard) and 79% (soft). Claude-4.5, while most robust overall, is still best attacked by CoTTA.
Sample comparisons (Figure 3) demonstrate that CoTTA induces near-imperceptible modifications yet reliably implants coercive prompts. MLLMs output not just semantically similar but often exact adversary-desired strings under non-cooperative user prompts.
Figure 3: Visualization of adversarial examples and perturbations: CoTTA modifications remain visually subtle while other attack patterns are more noticeable and less effective.
On VQA, CoTTA achieves 82% ASR/0.820 AvgSim (GPT-4o) and 79% ASR/0.787 AvgSim (Gemini-2.5) under the hard criterion—well above the best prior methods.
A large-scale experiment over 1000 images further confirms stability: CoTTA outperforms FOA-Attack by 23.27% ASR and 0.136 AvgSim on average across evaluated MLLMs.
Ablation Studies
An ablation of CoTTA's components reveals that every mechanism is indispensable for state-of-the-art performance. Removing image-to-image alignment causes the most severe ASR drop (54% under soft criterion), followed by absence of the covert trigger (15% drop). Maintaining a dynamic target image and dual alignment objectives proves necessary for both attack transferability and precise command expressivity.
Figure 4: Ablation on the weight coefficients λpull​ and λpush​: optimal performance is achieved for λpull​=1 and λpush​=0.5.
Hyperparameter sweeps identify that loss weighting for pull/push objectives optimally balances successful targeted injection and cross-modal transfer.
Theoretical and Practical Implications
This work demonstrates that covert, imperceptible prompt injection can be achieved via a direct image channel, even for black-box closed-source MLLMs with alignment safeguards. The empirical results expose a significant vulnerability: malicious commands unobservable to human inspection can reliably hijack downstream model outputs. By designing a semantic bridge in the form of a dynamic covert trigger and cross-modal dual-target optimization, the attack attains high output fidelity, undermining operational defenses premised on visual transparency or text filtering.
The distinction between adversarial image perturbation and prompt injection is dissolved; through CoTTA, attackers gain arbitrary expressivity (up to the granularity of the prompt) within a bounded ℓ∞​ budget—a previously unachieved goal for MLLM attacks.
Future Directions
Several avenues are opened:
- Adaptive defenses: Existing input filtering mechanisms and adversarial detectors may fail against such covert triggers; robust training or online detection must become cross-modal and semantically sensitive.
- Attack generalization: While demonstrated on specific commercial MLLMs, the methodology is extensible to other modalities (e.g., audio) and system-level LMM integrations.
- Task-specific manipulations: Beyond captioning and VQA, precise control attacks can target embodied MLLM agents, information retrieval, or decision-making tasks.
- Bootstrapped API attacks: CoTTA’s iterative visual target optimization suggests that richer paraphrase ensembles and reinforcement across token-level semantics can yield even more reliable prompt hijacking.
Conclusion
The CoTTA framework sets a new standard for adversarial prompt injection against MLLMs, combining imperceptibility with precise, instruction-level control without model internals. Through covert triggers, dual-target alignment, and dynamic supervision, CoTTA demonstrates that current MLLMs are broadly susceptible to API attacks under realistic threat models. These findings underscore both urgent security risks in current deployments and the necessity for robust, semantically aware multimodal safety architectures.
Reference:
"Adversarial Prompt Injection Attack on Multimodal LLMs" (2603.29418)