- The paper introduces 'AnyDoor', a novel framework that conducts test-time backdoor attacks on multimodal LLMs without needing training data manipulation.
- It leverages adversarial visual perturbations paired with textual triggers to exploit vulnerabilities in models like LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2.
- Experimental results on datasets like VQAv2 and SVIT underscore the method’s high success rates and emphasize the need for enhanced security defenses in MLLMs.
Test-Time Backdoor Attacks on Multimodal LLMs
Introduction
The importance of security in multimodal LLMs (MLLMs) has become paramount as their capabilities and applications grow. MLLMs, such as LLaVA-1.5, MiniGPT-4, InstructBLIP, and BLIP-2, have demonstrated impressive performance in understanding and generating content that spans multiple modalities, especially in vision-language contexts. However, this progress has also opened up vulnerabilities to backdoor attacks, traditionally executed by poisoning the training data. Unlike these conventional approaches, our work introduces "AnyDoor," a novel framework for executing test-time backdoor attacks solely during the test phase, without the need for training data access or manipulation.
Test-Time Backdoor Attacks
The concept behind test-time backdoor attacks diverges from standard backdoor methodologies by injecting the backdoor directly at test time through adversarial perturbations. Crucially, this attack framework capitalizes on the inherent multimodal capabilities of MLLMs. It strategically assigns the attack's setup—using visual adversarial perturbations—and activation—using textual trigger prompts—to exploit the different strengths of visual and textual modalities.
"AnyDoor" exemplifies this approach by demonstrating how universal adversarial perturbations can be leveraged to manipulate multimodal LLM responses. Notably, these perturbations can be dynamically adapted to modify trigger prompts or harmful effects, posing a significant challenge to existing defensive mechanisms for MLLMs.
Experimental Validation
Our experiments encompass several popular MLLMs and utilize datasets such as VQAv2, SVIT, and DALLE-3 to validate "AnyDoor." The findings confirm that the test-time backdoor attacks are effective across different MLLM architectures and datasets. Specifically, the attacks achieved significant success rates, especially when employing border attacks with minimal perturbation visibility.
Further analyses highlighted the influence of factors such as attacking strategies, perturbation budgets, and ensemble sample sizes on the attack's efficacy. These results underline the critical balance required between maintaining benign accuracy and achieving desired attack outcomes.
Ablation Studies and Further Analyses
Ablation studies further dissect the impact of various elements, including attack strategies, perturbation budgets, loss weights, and trigger or target phrase selections. These studies offer a deeper understanding of the attack mechanics and underscore the flexibility and robustness of our proposed "AnyDoor" approach, even under variations in attack configurations and conditions.
Conclusions and Impact
Our work exposes a previously underexplored vulnerability in MLLMs to test-time backdoor attacks, compelling the research community to reconsider existing security paradigms in multimodal deep learning models. While the demonstrated attacks provide valuable insights into potential weaknesses, they also stress the urgent need for developing robust defenses that can mitigate such test-time adversarial manipulations.
Given the demonstrated efficacy of "AnyDoor" across various contexts and models, future efforts must emphasize enhancing MLLMs' resistance to backdoor attacks, especially those that can be executed without prior training data poisoning. Concurrently, exploring the implications of such vulnerabilities across broader multimodal and multidomain applications remains a pivotal area for ongoing research.