A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation

Published 15 May 2026 in cs.CR and cs.CV | (2605.16090v1)

Abstract: Large vision-LLMs (LVLMs) have emerged as a powerful paradigm for multimodal intelligence, but their growing deployment also expands the attack surface of prompt injection. Despite this growing concern, existing attacks still suffer from a critical limitation: the injected prompt for one modality only steers the model's interpretation of that singular input. Alternatively, these attacks remain multimodal but fail to achieve cross-modal prompt perturbation. To bridge this gap, we introduce a novel cross-modal prompt injection attack CrossMPI, which can steer the model's interpretation of both textual and visual inputs via image-only prompt injection. Our design is underpinned by the following key breakthroughs. First, we turn the focus of the injected prompt perturbation optimization from the visual embedding space (typically with only $10^5$ parameters) to the model hidden state space (for multimodal information integration and with $10^7$ parameters). Then, two strategies are adopted to mitigate the optimization challenges posed by the larger parameter space. To constrain the optimized model parameter space, we introduce a layer selection strategy that identifies the layers most critical to multimodal integration. Interestingly, deviating from the past experience, our analysis reveals that the optimal layers for LVLM prompt perturbation reside in the middle of the model rather than the last. To constrain the image perturbation space, we propose a new distance-decremental perturbation budget assignment strategy that allocates budgets decrementally as the pixel distance to semantic-critical regions increases. Extensive experiments across multiple LVLMs and datasets show that our method significantly outperforms baseline approaches.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a novel attack that uses imperceptible image perturbations to hijack task execution in LVLMs while keeping text prompts benign.
It employs fusion-critical layer selection and importance-aware masking to inject malicious semantics during modality integration.
Empirical evaluations reveal high attack success rates in both white-box and black-box settings, highlighting significant security vulnerabilities.

Introduction

The proliferation of Large Vision-LLMs (LVLMs) has enabled complex multi-modal reasoning across visual and textual inputs. However, with the increase of their deployment, the attack surface has dramatically expanded, particularly in the domain of prompt injection where models can be manipulated to follow adversarial instructions. "A Cross-Modal Prompt Injection Attack against Large Vision-LLMs with Image-Only Perturbation" (2605.16090) introduces a fundamentally new paradigm: robust cross-modal prompt injection attacks that modify only the image, yet enable hijacking task execution even when the accompanying text prompt remains entirely benign.

This essay provides a systematic analysis of the paper's methodology, novel technical insights, empirical evaluation, and the implications such attacks pose for both vision-language security and future AI design.

Figure 1: Comparison of LVLM prompt injection paradigms and the new cross-modal attack: (A) image-only perceptual attacks (no downstream task hijack), (B) multimodal attacks with explicit text payloads, (C) cross-modal prompt injection where only the image is perturbed to force execution of a new task under a benign textual prompt.

Traditional prompt injection either targets text prompts (in LLMs) or both modalities (by simultaneously manipulating text and image in LVLMs). This work departs from that dichotomy by asking: can visually imperceptible perturbations to only the image channel rewrite the joint task grounding so that an LVLM executes a malicious, attacker-chosen task under a benign user query?

Formally, given a fixed, benign text prompt $x_p$ and an original image $x_v$ , the adversary aims to construct an imperceptible perturbation $\Delta$ such that the input $(x_p, x_v + \Delta)$ compels the model to behave as though it received $(x_p^t, x_v^t)$ (the attacker's favored prompt-image pair), and thus outputs a target response $y_t$ . This fundamentally differs from adversarial image attacks (which only alter recognition) or straightforward prompt injection (which is most effective with text manipulation). Here, the image perturbs the task-level semantics, steering the entire multi-modal interpretation.

Methodology: Model Space and Image Space Reduction

The attack pipeline introduces several methodological innovations to make image-only cross-modal hijacking tractable and effective in large-scale LVLMs:

1. Fusion-Critical Layer Selection

Model parameter space is immense; naive optimization over all possible hidden representations is computationally impractical and induces poor convergence. The authors identify, through exhaustive probing experiments across variants of prompts and tasks, that multimodal fusion—where visual and textual cues integrate—is not in the terminal layers (contrary to typical LLM tuning practice), but rather in mid-depth transformer layers. By localizing the attack to these "fusion-critical" layers, the adversary can surgically inject task semantics at the precise stage of modality binding, improving both optimization efficiency and behavioral control.

2. Importance- and Distance-Aware Image Masking

To further reduce complexity and maximize semantic impact, the authors compute a Grad-ECLIP-based importance map of the image, assigning amplified perturbation budgets to spatially-central, semantically-critical regions (weighted by both saliency and spatial proximity to a semantic centroid). This ensures the majority of the adversarial "budget" is concentrated where it is most influential for model reasoning, rather than wasting capacity on task-irrelevant backgrounds.

3. Joint Output and Hidden-State Supervision

The optimization objective jointly aligns two complementary signals: (i) output-level probability of generating the attacker-specified sequence, and (ii) fusion-critical hidden state alignment to the internal representation produced by the target input. This dual loss ensures both surface-level behavior and deep semantic interpretation are redirected. Additionally, frequency regularization enforces imperceptibility by discouraging visually high-frequency artifacts.

4. Augmentation-Based Robust Optimization

To enhance cross-model and real-world robustness, the perturbation is trained not only on the original image but also on a set of augmented variants (scaling, rotation, brightness, blurring, noise). This anticipates input pre-processing and ensures transferability under realistic post-processing.

Empirical Evaluation

The proposed cross-modal attack is evaluated on MSCOCO, ImageNet, and TextVQA datasets, across five state-of-the-art LVLMs: MiniGPT4-llama2, MiniGPT4-vicuna, InstructBlip, Blip-2, and BLIVA. The adversarial perturbations are crafted on a surrogate model and then transferred to the remaining targets, measuring both white-box (source) and black-box (transfer) ASR (attack success rate) and semantic similarity (SS) to the attacker’s intent.

Quantitative Results

White-Box (source model) ASR: Reached up to 100% on various datasets/models.
Black-Box (transfer) ASR: Up to 90+% on closely related models and strong transfer (54.3% on average) onto previously unseen LVLM architectures.
Semantic Similarity: High cosine similarity scores indicate not just task confusion but high fidelity semantic alignment to the attacker's intended response.

Baseline comparisons show that prior image-only attacks (e.g., CLIP-based or embedding attacks) fail to induce reliable task hijack, and explicit text-in-image prompt attacks are visually conspicuous and less stealthy.

Ablation studies demonstrate:

Fusion-layer targeting is dramatically superior for transferability and task-level impact versus optimization over early (pre-fusion) or tail (generation) transformer layers.
Moderate budget reallocation (λ ≈ 0.3) balances semantic focus without sacrificing attack coverage.
A 16/255 perturbation norm is optimal for imperceptibility/efficacy tradeoff.
Optimizing 3 fusion-critical layers yields best effectiveness—too narrow (1-2 layers) limits influence; too wide degrades stability.

Visual Analysis

Cross-modal attacks preserve high visual fidelity: the adversarial images are near-indistinguishable from originals, whereas embedding-alignment or explicit text-in-image methods introduce blatant perceptual artifacts or text overlays that are detectable upon visual inspection.

Discussion: Implications and Defense

The demonstrated attack exposes a new category of vulnerability: even when system prompts, user queries, and model weights are uncompromised, merely uploading a “contaminated” image can trigger an unmediated shift in model task interpretation. The threat model is highly practical—requiring no access to victim model internals and functioning in full black-box transfer settings.

Defensive techniques, including standard input transformations (randomization, JPEG compression, diffusion restoration) and inference-time safeguarding (SmoothVLM, DPS), offer limited protection. The attack’s focus on semantic anchors and model-level alignment make it robust to such interventions, suggesting a need for fundamentally new defense paradigms. These could include input provenance verification, causal consistency checks, multi-view redundancy, or human-in-the-loop confirmation for high-impact actions—especially pertinent as LVLMs underpin real-world agentic systems.

Theoretical and Practical Implications

This work provides evidence that the model's task understanding is far more malleable at the fusion level than anticipated. It underscores the inadequacy of prompt-based defenses in LVLMs and spotlights a critical blind spot in current system design: attacks can hijack both perception and downstream semantic reasoning via non-text modalities, even if all instructions appear benign.

The insights into fusion-critical layers may further inform both offensive and defensive future research—understanding and monitoring hidden-state dynamics may be necessary for robust security guarantees. Practically, adversaries can weaponize common multimedia channels (e.g., images in web forms, screenshots fed to digital agents) to mount undetected prompt injections.

Future Directions

The transferability and stealth of these attacks motivate deeper studies of:

Multi-turn or session-persistent prompt injection via images.
Black-box attacks against proprietary or online LVLM APIs with agentic capabilities.
Embedding-hidden triggers in more complex handled formats (e.g., PDFs, video streams, webpages).
Defenses leveraging causal anomaly detection in hidden-state trajectories, multi-modal provenance tracing, and cross-channel consistency checking.

Conclusion

The proposed cross-modal prompt injection attack designates a new threat vector for LVLMs: exhaustive task hijacking via minimal, imperceptible perturbation to only the image input. Through principled fusion-layer and semantic-region targeting, the attack bypasses both standard and advanced safeguards, reconfiguring multimodal grounding at a fundamental level. This exposes a critical need for more principled LVLM alignment, capable of defending against semantically sophisticated, modality-agnostic prompt manipulation.

Figure 1: The attack paradigm: only the image is perturbed to hijack a benign textual task prompt, in contrast to prior unimodal and multimodal prompt injection techniques.