On Evaluating Adversarial Robustness of Large Vision-Language Models (2305.16934v2)

Published 26 May 2023 in cs.CV, cs.CL, cs.CR, cs.LG, and cs.MM

Abstract: Large vision-LLMs (VLMs) such as GPT-4 have achieved unprecedented performance in response generation, especially with visual inputs, enabling more creative and adaptable interaction than LLMs such as ChatGPT. Nonetheless, multimodal generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable modality (e.g., vision). To this end, we propose evaluating the robustness of open-source large VLMs in the most realistic and high-risk setting, where adversaries have only black-box system access and seek to deceive the model into returning the targeted responses. In particular, we first craft targeted adversarial examples against pretrained models such as CLIP and BLIP, and then transfer these adversarial examples to other VLMs such as MiniGPT-4, LLaVA, UniDiffuser, BLIP-2, and Img2Prompt. In addition, we observe that black-box queries on these VLMs can further improve the effectiveness of targeted evasion, resulting in a surprisingly high success rate for generating targeted responses. Our findings provide a quantitative understanding regarding the adversarial vulnerability of large VLMs and call for a more thorough examination of their potential security flaws before deployment in practice. Code is at https://github.com/yunqing-me/AttackVLM.

PDF Abstract

Evaluating the Adversarial Robustness of Vision-LLMs

The paper "On Evaluating Adversarial Robustness of Vision-LLMs" explores a critical issue confronting the deployment of large vision-LLMs (VLMs): their vulnerability to adversarial attacks. With the increasing incorporation of multimodality in AI models, especially those capable of processing both text and visual inputs, security concerns have become more pronounced. This research addresses the susceptibility of these models to adversarial inputs, specifically focusing on scenarios where malicious attackers aim to manipulate visual inputs in order to induce incorrect or targeted textual outputs.

VLMs, like GPT-4, have harnessed the potential of multimodal integration to achieve advanced conversational capabilities, yet this integration also forms a fertile ground for adversarial exploits. The authors focus on a real-world threat model: black-box access with targeted attack goals. In contrast to the more controllable text modality, the visual modality's susceptibility to subtle perturbations can be exploited without significantly altering the invisible integrity of the input, presenting a considerable security risk.

Methodology Overview

The authors propose a two-pronged strategy for adversarial robustness evaluation, consisting of transfer-based and query-based attack methodologies:

Transfer-Based Attacks: Leveraging models like CLIP as surrogate architectures, they apply adversarial image creation that aligns with chosen targeted textual descriptions. Two approaches were examined:
- Matching Image-Text Features (MF-it): Cross-modality feature matching between visual embeddings and targeted text.
- Matching Image-Image Features (MF-ii): Intramodality feature matching between the input image and an adversarially generated image conditioned on the targeted text, crafted using text-to-image models like Stable Diffusion.
Query-Based Attacks: Harnessing randomly initialized perturbations to iteratively estimate the gradient through the evaluation of output similarity between the generated adversarial responses and the targeted text.

Importantly, by combining these strategies, the paper demonstrates that it is feasible to successfully craft adversarial images that elicit specific targeted texts from several state-of-the-art VLMs, including MiniGPT-4, LLaVA, and UniDiffuser.

Experimental Insights

The experiments reveal significant insights into the adversarial vulnerabilities of large VLMs:

Effectiveness Across Models: The attack methods successfully fooled a range of large VLMs, indicating a widespread vulnerability within the current architectures. Interestingly, transfer-based attacks using MF-ii on their own showed stronger black-box transferability compared to MF-it.
Iterative Optimization: The combination of transfer and query-based methods resulted in a higher success rate in generating targeted textual outputs, reflecting the models’ susceptibility when faced with persistent adversarial noise.
In-depth Analysis: GradCAM visualization helped illustrate how the adversarial perturbations redirect models’ attention away from the original content, focusing on irrelevant or targeted parts of the input space.

Practical and Theoretical Implications

The paper amplifies the need for improved security measures in the deployment of VLMs. As these models become more integrated into applications, the wider implications of adversarial robustness encompass fields from automated content moderation to interactive AI assistants. On the theoretical side, this paper challenges researchers to contemplate more robust architectures that reduce the transferability of adversarial perturbations across different models.

Future Prospects

Addressing these vulnerabilities necessitates a multidisciplinary effort encompassing adversarial training, robust AI architectural design, and potentially even broader regulatory frameworks to manage AI safety in real-world applications. The exploration of physical-world adversarial impacts, model interpretability advancements, and continuous security evaluations are promising future research directions that stem from this work.

In essence, the research contributes critical insights into the security imperatives of ensuring safe and reliable AI applications, fostering developments in adversarial defense mechanisms that can underpin safer deployment of multimodal AI systems in diverse settings.