Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models (2402.06659v2)
Abstract: Vision-LLMs (VLMs) excel in generating textual responses from visual inputs, but their versatility raises security concerns. This study takes the first step in exposing VLMs' susceptibility to data poisoning attacks that can manipulate responses to innocuous, everyday prompts. We introduce Shadowcast, a stealthy data poisoning attack where poison samples are visually indistinguishable from benign images with matching texts. Shadowcast demonstrates effectiveness in two attack types. The first is a traditional Label Attack, tricking VLMs into misidentifying class labels, such as confusing Donald Trump for Joe Biden. The second is a novel Persuasion Attack, leveraging VLMs' text generation capabilities to craft persuasive and seemingly rational narratives for misinformation, such as portraying junk food as healthy. We show that Shadowcast effectively achieves the attacker's intentions using as few as 50 poison samples. Crucially, the poisoned samples demonstrate transferability across different VLM architectures, posing a significant concern in black-box settings. Moreover, Shadowcast remains potent under realistic conditions involving various text prompts, training data augmentation, and image compression techniques. This work reveals how poisoned VLMs can disseminate convincing yet deceptive misinformation to everyday, benign users, emphasizing the importance of data integrity for responsible VLM deployments. Our code is available at: https://github.com/umd-huang-lab/VLM-Poisoning.
- OpenAI. Gpt-4v(ision) system card. 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Visual adversarial examples jailbreak large language models. CoRR, abs/2306.13213, 2023.
- Are aligned neural networks adversarially aligned? NeurIPS, 2023a.
- Decodingtrust: A comprehensive assessment of trustworthiness in GPT models. In NeurIPS, 2023.
- Trustllm: Trustworthiness in large language models. arXiv preprint arXiv: 2401.05561, 2024.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv: 2307.15043, 2023.
- Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv: 2310.15140, 2023b.
- Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. NeurIPS, 2023.
- Image hijacks: Adversarial images can control generative models at runtime. arXiv preprint arXiv: 2309.00236, 2023.
- On evaluating adversarial robustness of large vision-language models. NeurIPS, 2023.
- How robust is google’s bard to adversarial image attacks? arXiv preprint arXiv: 2309.11751, 2023.
- Poisoning attacks against support vector machines. arXiv preprint arXiv:1206.6389, 2012.
- Just how toxic is data poisoning? a unified benchmark for backdoor and data poisoning attacks. In International Conference on Machine Learning, pages 9389–9398. PMLR, 2021.
- Poison frogs! targeted clean-label poisoning attacks on neural networks. Advances in neural information processing systems, 31, 2018.
- Data poisoning attacks against multimodal encoders. In International Conference on Machine Learning, pages 39299–39313. PMLR, 2023.
- Poisoning and backdooring contrastive learning. In International Conference on Learning Representations, 2022.
- Prompt-specific poisoning attacks on text-to-image generative models. arXiv preprint arXiv:2310.13828, 2023.
- On the proactive generation of unsafe images from text-to-image models using benign prompts. arXiv preprint arXiv:2310.16613, 2023.
- On the exploitability of instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Neural Information Processing Systems, 2022. doi:10.48550/arXiv.2210.08402.
- Multimodal C4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023c.
- Poisoning web-scale training datasets is practical. arXiv preprint arXiv:2302.10149, 2023b.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Witches’ brew: Industrial scale data poisoning via gradient matching. arXiv preprint arXiv:2009.02276, 2020.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
- Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
- Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016.
- Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506–519, 2017.
- Jpeg-resistant adversarial images. In NIPS 2017 Workshop on Machine Learning and Computer Security, volume 1, page 8, 2017.
- Not all poisons are created equal: Robust training against data poisoning. In International Conference on Machine Learning, pages 25154–25165. PMLR, 2022.
- What doesn’t kill you makes you robust (er): How to adversarially train against data poisoning. arXiv preprint arXiv:2102.13624, 2021.
- Yuancheng Xu (17 papers)
- Jiarui Yao (9 papers)
- Manli Shu (23 papers)
- Yanchao Sun (32 papers)
- Zichu Wu (3 papers)
- Ning Yu (78 papers)
- Tom Goldstein (226 papers)
- Furong Huang (150 papers)