Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance

Published 19 Aug 2025 in cs.CV | (2508.13739v1)

Abstract: Targeted adversarial attacks are essential for proactively identifying security flaws in Vision-LLMs before real-world deployment. However, current methods perturb images to maximize global similarity with the target text or reference image at the encoder level, collapsing rich visual semantics into a single global vector. This limits attack granularity, hindering fine-grained manipulations such as modifying a car while preserving its background. Furthermore, these methods largely overlook the projector module, a critical semantic bridge between the visual encoder and the LLM in VLMs, thereby failing to disrupt the full vision-language alignment pipeline within VLMs and limiting attack effectiveness. To address these issues, we propose the Intermediate Projector Guided Attack (IPGA), the first method to attack using the intermediate stage of the projector module, specifically the widely adopted Q-Former, which transforms global image embeddings into fine-grained visual features. This enables more precise control over adversarial perturbations by operating on semantically meaningful visual tokens rather than a single global representation. Specifically, IPGA leverages the Q-Former pretrained solely on the first vision-language alignment stage, without LLM fine-tuning, which improves both attack effectiveness and transferability across diverse VLMs. Furthermore, we propose Residual Query Alignment (RQA) to preserve unrelated visual content, thereby yielding more controlled and precise adversarial manipulations. Extensive experiments show that our attack method consistently outperforms existing methods in both standard global image captioning tasks and fine-grained visual question-answering tasks in black-box environment. Additionally, IPGA successfully transfers to multiple commercial VLMs, including Google Gemini and OpenAI GPT.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces IPGA, which leverages the Q-Former to enhance targeted adversarial attacks by moving beyond global visual feature manipulation.
It employs a multi-loss strategy including image-text contrastive, grounded text generation, and matching losses to align adversarial images with target text.
Experiments on platforms like Google Gemini and OpenAI GPT demonstrate IPGA's superior performance in global image captioning and fine-grained VQA tasks.

Enhancing Targeted Adversarial Attacks on Large Vision-LLMs through Intermediate Projector Guidance

Introduction

The paper proposes Intermediate Projector Guided Attack (IPGA), a method designed to improve targeted adversarial attacks on large Vision-LLMs (VLMs) by using an intermediate projector, specifically the Q-Former, which is essential for transforming global visual input into fine-grained features. This paper highlights the limitations of current methods, which primarily focus on creating adversarial images by manipulating global visual features, overlooking the projector module, thus limiting the attack's granularity and effectiveness.

Methodology

IPGA leverages the Q-Former, employed in the first stage of VLM training to align visual features with text. By targeting this intermediary projector rather than the visual encoder directly, the method aims for more precise adversarial manipulation. The attack is guided by optimizing three losses associated with Q-Former:

Image-Text Contrastive Loss ( $\mathcal{L}_{\text{ITC}}$ ): Aligns adversarial image features with target text embeddings while diverging from clean text embeddings.
Image-Grounded Text Generation Loss ( $\mathcal{L}_{\text{ITG}}$ ): Optimizes the probability of generating the target text over the clean text.
Image-Text Matching Loss ( $\mathcal{L}_{\text{ITM}}$ ): Encourages correct matching with the target text and mismatching with the clean text.

The combination of these losses is further augmented by encoder-level alignment for global attacks, combining to form the full IPGA objective.

Figure 1: The framework of IPGA, highlighting the utilization of the Q-Former for fine-grained visual manipulation.

Performance Evaluation

Extensive experiments demonstrate IPGA's effectiveness, consistently surpassing existing global and fine-grained attack methods on several open-source and commercial platforms, including Google Gemini and OpenAI GPT.

Global Attacks: The method shows high efficacy in altering the semantic output of VLMs during image captioning tasks. Experiments using ImageNet-1K and MS-COCO demonstrate superior performance over baseline models, reflected in higher CLIP score alignments.

Figure 2: Comparison of IPGA and IPGA-R against baselines, demonstrating superior results on BLIP-2.

Fine-Grained Attacks: When applied to Visual Question-Answering (VQA), the fine-grained adversarial capability of IPGA allows for manipulation of specific image elements without disrupting unrelated content. This is achieved through the Residual Query Alignment (RQA) module, which ensures that only relevant features are altered.

Figure 3: Successful transfer of fine-grained attacks using IPGA-R, illustrated through targeted questions compared across various open-source VLMs.

Implementation Insights and Future Work

The implementation of IPGA and its variant with RQA is performed using popular frameworks like PyTorch on high-performance computing setups. The approach supports the need for improved VLM robustness evaluation by extending the attack surface to intermediate model components.

Computational Considerations: Deployment of the described methods will require significant computational resources, especially for large datasets or real-time applications. Investigations into optimizing the attack's efficiency could be explored in future research.

In conclusion, IPGA represents a significant step forward in adversarial attacks on VLMs, providing enhanced control over adversarial perturbations and demonstrating its broad applicability across models with diverse architectures. Future improvements may focus on refining attack efficiency and extending applicability under varying constraints.

Markdown Report Issue