Test-Time Backdoor Attacks on Multimodal Large Language Models (2402.08577v1)

Published 13 Feb 2024 in cs.CL, cs.CR, cs.CV, cs.LG, and cs.MM

Abstract: Backdoor attacks are commonly executed by contaminating training data, such that a trigger can activate predetermined harmful effects during the test phase. In this work, we present AnyDoor, a test-time backdoor attack against multimodal LLMs (MLLMs), which involves injecting the backdoor into the textual modality using adversarial test images (sharing the same universal perturbation), without requiring access to or modification of the training data. AnyDoor employs similar techniques used in universal adversarial attacks, but distinguishes itself by its ability to decouple the timing of setup and activation of harmful effects. In our experiments, we validate the effectiveness of AnyDoor against popular MLLMs such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, as well as provide comprehensive ablation studies. Notably, because the backdoor is injected by a universal perturbation, AnyDoor can dynamically change its backdoor trigger prompts/harmful effects, exposing a new challenge for defending against backdoor attacks. Our project page is available at https://sail-sg.github.io/AnyDoor/.

Citations (12)

View on Semantic Scholar

Summary

The paper introduces 'AnyDoor', a novel framework that conducts test-time backdoor attacks on multimodal LLMs without needing training data manipulation.
It leverages adversarial visual perturbations paired with textual triggers to exploit vulnerabilities in models like LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2.
Experimental results on datasets like VQAv2 and SVIT underscore the method’s high success rates and emphasize the need for enhanced security defenses in MLLMs.

Test-Time Backdoor Attacks on Multimodal LLMs

Introduction

The importance of security in multimodal LLMs (MLLMs) has become paramount as their capabilities and applications grow. MLLMs, such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, have demonstrated impressive performance in understanding and generating content that spans multiple modalities, especially in vision-language contexts. However, this progress has also opened up vulnerabilities to backdoor attacks, traditionally executed by poisoning the training data. Unlike these conventional approaches, our work introduces "AnyDoor," a novel framework for executing test-time backdoor attacks solely during the test phase, without the need for training data access or manipulation.

Test-Time Backdoor Attacks

The concept behind test-time backdoor attacks diverges from standard backdoor methodologies by injecting the backdoor directly at test time through adversarial perturbations. Crucially, this attack framework capitalizes on the inherent multimodal capabilities of MLLMs. It strategically assigns the attack's setup—using visual adversarial perturbations—and activation—using textual trigger prompts—to exploit the different strengths of visual and textual modalities.

"AnyDoor" exemplifies this approach by demonstrating how universal adversarial perturbations can be leveraged to manipulate multimodal LLM responses. Notably, these perturbations can be dynamically adapted to modify trigger prompts or harmful effects, posing a significant challenge to existing defensive mechanisms for MLLMs.

Experimental Validation

Our experiments encompass several popular MLLMs and utilize datasets such as VQAv2, SVIT, and DALLE-3 to validate "AnyDoor." The findings confirm that the test-time backdoor attacks are effective across different MLLM architectures and datasets. Specifically, the attacks achieved significant success rates, especially when employing border attacks with minimal perturbation visibility.

Further analyses highlighted the influence of factors such as attacking strategies, perturbation budgets, and ensemble sample sizes on the attack's efficacy. These results underline the critical balance required between maintaining benign accuracy and achieving desired attack outcomes.

Ablation Studies and Further Analyses

Ablation studies further dissect the impact of various elements, including attack strategies, perturbation budgets, loss weights, and trigger or target phrase selections. These studies offer a deeper understanding of the attack mechanics and underscore the flexibility and robustness of our proposed "AnyDoor" approach, even under variations in attack configurations and conditions.

Conclusions and Impact

Our work exposes a previously underexplored vulnerability in MLLMs to test-time backdoor attacks, compelling the research community to reconsider existing security paradigms in multimodal deep learning models. While the demonstrated attacks provide valuable insights into potential weaknesses, they also stress the urgent need for developing robust defenses that can mitigate such test-time adversarial manipulations.

Given the demonstrated efficacy of "AnyDoor" across various contexts and models, future efforts must emphasize enhancing MLLMs' resistance to backdoor attacks, especially those that can be executed without prior training data poisoning. Concurrently, exploring the implications of such vulnerabilities across broader multimodal and multidomain applications remains a pivotal area for ongoing research.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/TianyuPang1/status/1757799204402167878

https://twitter.com/sntrplabs/status/1757854614139838919

https://twitter.com/sntrplabs/status/1757791782425776355