- The paper introduces a novel attack that uses imperceptible image perturbations to hijack task execution in LVLMs while keeping text prompts benign.
- It employs fusion-critical layer selection and importance-aware masking to inject malicious semantics during modality integration.
- Empirical evaluations reveal high attack success rates in both white-box and black-box settings, highlighting significant security vulnerabilities.
Cross-Modal Prompt Injection Attacks on LVLMs via Image-Only Perturbations
Introduction
The proliferation of Large Vision-LLMs (LVLMs) has enabled complex multi-modal reasoning across visual and textual inputs. However, with the increase of their deployment, the attack surface has dramatically expanded, particularly in the domain of prompt injection where models can be manipulated to follow adversarial instructions. "A Cross-Modal Prompt Injection Attack against Large Vision-LLMs with Image-Only Perturbation" (2605.16090) introduces a fundamentally new paradigm: robust cross-modal prompt injection attacks that modify only the image, yet enable hijacking task execution even when the accompanying text prompt remains entirely benign.
This essay provides a systematic analysis of the paper's methodology, novel technical insights, empirical evaluation, and the implications such attacks pose for both vision-language security and future AI design.
Figure 1: Comparison of LVLM prompt injection paradigms and the new cross-modal attack: (A) image-only perceptual attacks (no downstream task hijack), (B) multimodal attacks with explicit text payloads, (C) cross-modal prompt injection where only the image is perturbed to force execution of a new task under a benign textual prompt.
Traditional prompt injection either targets text prompts (in LLMs) or both modalities (by simultaneously manipulating text and image in LVLMs). This work departs from that dichotomy by asking: can visually imperceptible perturbations to only the image channel rewrite the joint task grounding so that an LVLM executes a malicious, attacker-chosen task under a benign user query?
Formally, given a fixed, benign text prompt xp and an original image xv, the adversary aims to construct an imperceptible perturbation Δ such that the input (xp,xv+Δ) compels the model to behave as though it received (xpt,xvt) (the attacker's favored prompt-image pair), and thus outputs a target response yt. This fundamentally differs from adversarial image attacks (which only alter recognition) or straightforward prompt injection (which is most effective with text manipulation). Here, the image perturbs the task-level semantics, steering the entire multi-modal interpretation.
Methodology: Model Space and Image Space Reduction
The attack pipeline introduces several methodological innovations to make image-only cross-modal hijacking tractable and effective in large-scale LVLMs:
1. Fusion-Critical Layer Selection
Model parameter space is immense; naive optimization over all possible hidden representations is computationally impractical and induces poor convergence. The authors identify, through exhaustive probing experiments across variants of prompts and tasks, that multimodal fusion—where visual and textual cues integrate—is not in the terminal layers (contrary to typical LLM tuning practice), but rather in mid-depth transformer layers. By localizing the attack to these "fusion-critical" layers, the adversary can surgically inject task semantics at the precise stage of modality binding, improving both optimization efficiency and behavioral control.
2. Importance- and Distance-Aware Image Masking
To further reduce complexity and maximize semantic impact, the authors compute a Grad-ECLIP-based importance map of the image, assigning amplified perturbation budgets to spatially-central, semantically-critical regions (weighted by both saliency and spatial proximity to a semantic centroid). This ensures the majority of the adversarial "budget" is concentrated where it is most influential for model reasoning, rather than wasting capacity on task-irrelevant backgrounds.
3. Joint Output and Hidden-State Supervision
The optimization objective jointly aligns two complementary signals: (i) output-level probability of generating the attacker-specified sequence, and (ii) fusion-critical hidden state alignment to the internal representation produced by the target input. This dual loss ensures both surface-level behavior and deep semantic interpretation are redirected. Additionally, frequency regularization enforces imperceptibility by discouraging visually high-frequency artifacts.
4. Augmentation-Based Robust Optimization
To enhance cross-model and real-world robustness, the perturbation is trained not only on the original image but also on a set of augmented variants (scaling, rotation, brightness, blurring, noise). This anticipates input pre-processing and ensures transferability under realistic post-processing.
Empirical Evaluation
The proposed cross-modal attack is evaluated on MSCOCO, ImageNet, and TextVQA datasets, across five state-of-the-art LVLMs: MiniGPT4-llama2, MiniGPT4-vicuna, InstructBlip, Blip-2, and BLIVA. The adversarial perturbations are crafted on a surrogate model and then transferred to the remaining targets, measuring both white-box (source) and black-box (transfer) ASR (attack success rate) and semantic similarity (SS) to the attacker’s intent.
Quantitative Results
- White-Box (source model) ASR: Reached up to 100% on various datasets/models.
- Black-Box (transfer) ASR: Up to 90+% on closely related models and strong transfer (54.3% on average) onto previously unseen LVLM architectures.
- Semantic Similarity: High cosine similarity scores indicate not just task confusion but high fidelity semantic alignment to the attacker's intended response.
Baseline comparisons show that prior image-only attacks (e.g., CLIP-based or embedding attacks) fail to induce reliable task hijack, and explicit text-in-image prompt attacks are visually conspicuous and less stealthy.
Ablation studies demonstrate:
- Fusion-layer targeting is dramatically superior for transferability and task-level impact versus optimization over early (pre-fusion) or tail (generation) transformer layers.
- Moderate budget reallocation (λ ≈ 0.3) balances semantic focus without sacrificing attack coverage.
- A 16/255 perturbation norm is optimal for imperceptibility/efficacy tradeoff.
- Optimizing 3 fusion-critical layers yields best effectiveness—too narrow (1-2 layers) limits influence; too wide degrades stability.
Visual Analysis
Cross-modal attacks preserve high visual fidelity: the adversarial images are near-indistinguishable from originals, whereas embedding-alignment or explicit text-in-image methods introduce blatant perceptual artifacts or text overlays that are detectable upon visual inspection.
Discussion: Implications and Defense
The demonstrated attack exposes a new category of vulnerability: even when system prompts, user queries, and model weights are uncompromised, merely uploading a “contaminated” image can trigger an unmediated shift in model task interpretation. The threat model is highly practical—requiring no access to victim model internals and functioning in full black-box transfer settings.
Defensive techniques, including standard input transformations (randomization, JPEG compression, diffusion restoration) and inference-time safeguarding (SmoothVLM, DPS), offer limited protection. The attack’s focus on semantic anchors and model-level alignment make it robust to such interventions, suggesting a need for fundamentally new defense paradigms. These could include input provenance verification, causal consistency checks, multi-view redundancy, or human-in-the-loop confirmation for high-impact actions—especially pertinent as LVLMs underpin real-world agentic systems.
Theoretical and Practical Implications
This work provides evidence that the model's task understanding is far more malleable at the fusion level than anticipated. It underscores the inadequacy of prompt-based defenses in LVLMs and spotlights a critical blind spot in current system design: attacks can hijack both perception and downstream semantic reasoning via non-text modalities, even if all instructions appear benign.
The insights into fusion-critical layers may further inform both offensive and defensive future research—understanding and monitoring hidden-state dynamics may be necessary for robust security guarantees. Practically, adversaries can weaponize common multimedia channels (e.g., images in web forms, screenshots fed to digital agents) to mount undetected prompt injections.
Future Directions
The transferability and stealth of these attacks motivate deeper studies of:
- Multi-turn or session-persistent prompt injection via images.
- Black-box attacks against proprietary or online LVLM APIs with agentic capabilities.
- Embedding-hidden triggers in more complex handled formats (e.g., PDFs, video streams, webpages).
- Defenses leveraging causal anomaly detection in hidden-state trajectories, multi-modal provenance tracing, and cross-channel consistency checking.
Conclusion
The proposed cross-modal prompt injection attack designates a new threat vector for LVLMs: exhaustive task hijacking via minimal, imperceptible perturbation to only the image input. Through principled fusion-layer and semantic-region targeting, the attack bypasses both standard and advanced safeguards, reconfiguring multimodal grounding at a fundamental level. This exposes a critical need for more principled LVLM alignment, capable of defending against semantically sophisticated, modality-agnostic prompt manipulation.
Figure 1: The attack paradigm: only the image is perturbed to hijack a benign textual task prompt, in contrast to prior unimodal and multimodal prompt injection techniques.