Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 12 tok/s Pro

GPT-5 High 21 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 231 tok/s Pro

GPT OSS 120B 435 tok/s Pro

Claude Sonnet 4 33 tok/s Pro

2000 character limit reached

VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision Language Models (2505.17440v1)

Published 23 May 2025 in cs.CV

Abstract: Large Vision-LLMs (LVLMs) have demonstrated remarkable capabilities in multimodal understanding and generation, yet their vulnerability to adversarial attacks raises significant robustness concerns. While existing effective attacks always focus on task-specific white-box settings, these approaches are limited in the context of LVLMs, which are designed for diverse downstream tasks and require expensive full-model gradient computations. Motivated by the pivotal role and wide adoption of the vision encoder in LVLMs, we propose a simple yet effective Vision Encoder Attack (VEAttack), which targets the vision encoder of LVLMs only. Specifically, we propose to generate adversarial examples by minimizing the cosine similarity between the clean and perturbed visual features, without accessing the following LLMs, task information, and labels. It significantly reduces the computational overhead while eliminating the task and label dependence of traditional white-box attacks in LVLMs. To make this simple attack effective, we propose to perturb images by optimizing image tokens instead of the classification token. We provide both empirical and theoretical evidence that VEAttack can easily generalize to various tasks. VEAttack has achieved a performance degradation of 94.5% on image caption task and 75.7% on visual question answering task. We also reveal some key observations to provide insights into LVLM attack/defense: 1) hidden layer variations of LLM, 2) token attention differential, 3) M\"obius band in transfer attack, 4) low sensitivity to attack steps. The code is available at https://github.com/hfmei/VEAttack-LVLM

Summary

The paper presents VEAttack, a novel attack strategy that perturbs image tokens using cosine similarity to compromise LVLM vision encoders.
It achieves remarkable performance drops, with a 94.5% decrease in image captioning and 75.7% in visual question answering, demonstrating cross-task generalization.
The method minimizes computational costs by exclusively targeting vision encoders, offering scalable insights into LVLM vulnerabilities.

VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision LLMs

The paper "VEAttack: Downstream-agnostic Vision Encoder Attack against Large Vision LLMs" (2505.17440) explores a novel approach to compromising the integrity of Large Vision-LLMs (LVLMs) through targeted attacks on their vision encoders. LVLMs, which integrate vision encoders with LLMs, are vulnerable to adversarial perturbations, particularly from the vision inputs that propagate errors into multimodal outputs. VEAttack strategically attacks the vision encoder, minimizing computational overhead, and demonstrating generalization across diverse tasks such as image captioning and visual question answering.

Attack Overview

VEAttack addresses the limitations of traditional white-box attacks, which often rely on full-model access and are task-specific. These attacks require significant gradient computations, making them computationally expensive, particularly in LVLMs designed for multiple downstream tasks. By solely targeting the vision encoder and optimizing perturbations through cosine similarity between clean and perturbed visual features, VEAttack presents a significant reduction in dependency on task-specific data and computational costs, achieving enhanced attack effectiveness across diverse LVLM applications.

Figure 1: The illustration of different attack paradigms where the white modules are accessible to the attacker, while the dark gray modules are inaccessible during the attack.

Methodology

The central innovation of VEAttack is its focus on perturbing image tokens rather than class tokens. Given that LVLMs utilize image token features during inference, this strategy ensures broader and more effective perturbations. The attack objective is redefined from full access on LVLMs to a vision encoder-centric approach, leveraging the shared representations of vision encoders across tasks to improve generalization. By minimizing the cosine similarity of encoded clean and perturbed images, VEAttack robustly affects downstream tasks without specific task dependencies.

Figure 2: The illustration of the overall framework of VEAttack, where it solely attacks the vision encoder of LVLMs within a downstream-agnostic context.

Experimental Results

Empirical evaluations reveal the effectiveness of VEAttack across several benchmark datasets. It achieves extreme performance degradation on image captioning tasks by $94.5\%$ and on visual question answering by $75.7\%$ . Moreover, VEAttack demonstrates strong transferability across tasks, aligning with Proposition 1's theoretical expectations regarding lower-bound perturbation impacts on aligned features.

Figure 3: Quantitative results on image captioning of clean samples, traditional white-box APGD attack, and VEAttack on LLaVa1.5-7B.

Observations

VEAttack reveals several key insights into LVLM vulnerabilities:

Hidden Layer Variation: Attacking the vision encoder induces notable deviations in the LLM's hidden layers, affecting downstream output.
Token Attention Differential: There is a significant differential in how LVLMs pay attention to image versus instruction tokens across tasks, impacting vulnerability.
Möbius Band Phenomenon: Strengthening the robustness of vision encoders can inadvertently enhance the adversarial sample transferability, suggesting an intertwined pathway of robustness and vulnerability.
Low Sensitivity to Attack Steps: Demonstrates efficiency as reducing the number of attack iterations does not significantly impact VEAttack's effectiveness.
Figure 4: Illustration of attention maps across different layers of the LLM for two tasks.

Conclusion

VEAttack offers a computationally efficient and highly effective adversarial attack strategy against LVLMs by targeting vision encoders in a downstream-agnostic manner. This method not only highlights the vulnerabilities of LVLMs to adversarial inputs but also proposes a scalable attack mechanism, reducing traditional computation overheads by an order of magnitude. Future work can focus on developing countermeasures to improve the robustness of LVLMs against such attacks, potentially by reinforcing vision encoder resistance without sacrificing system performance.