Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors (2405.10529v2)

Published 17 May 2024 in cs.CV and cs.AI

Abstract: LLMs have become increasingly prominent, also signaling a shift towards multimodality as the next frontier in artificial intelligence, where their embeddings are harnessed as prompts to generate textual content. Vision-LLMs (VLMs) stand at the forefront of this advancement, offering innovative ways to combine visual and textual data for enhanced understanding and interaction. However, this integration also enlarges the attack surface. Patch-based adversarial attack is considered the most realistic threat model in physical vision applications, as demonstrated in many existing literature. In this paper, we propose to address patched visual prompt injection, where adversaries exploit adversarial patches to generate target content in VLMs. Our investigation reveals that patched adversarial prompts exhibit sensitivity to pixel-wise randomization, a trait that remains robust even against adaptive attacks designed to counteract such defenses. Leveraging this insight, we introduce SmoothVLM, a defense mechanism rooted in smoothing techniques, specifically tailored to protect VLMs from the threat of patched visual prompt injectors. Our framework significantly lowers the attack success rate to a range between 0% and 5.0% on two leading VLMs, while achieving around 67.3% to 95.0% context recovery of the benign images, demonstrating a balance between security and usability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Synthesizing robust adversarial examples. In International conference on machine learning, pages 284–293. PMLR.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  4. Image hijacks: Adversarial images can control generative models at runtime. Preprint, arXiv:2309.00236.
  5. Adversarial patch.
  6. Dynamic adversarial attacks on autonomous driving systems. arXiv preprint arXiv:2312.06701.
  7. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  8. Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199.
  9. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR.
  10. Aligning language models with preferences through f-divergence minimization. arXiv preprint arXiv:2302.08215.
  11. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90.
  12. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR.
  13. Imagenet classification with deep convolutional neural networks. Commun. ACM, (6):84–90.
  14. Minjong Lee and Dongwoo Kim. 2023. Robust evaluation of diffusion-based adversarial purification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 134–144.
  15. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  16. Visual instruction tuning. Advances in neural information processing systems, 36.
  17. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
  18. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In CVPR, pages 427–436. IEEE.
  19. Diffusion models for adversarial purification. arXiv preprint arXiv:2205.07460.
  20. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  21. Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213.
  22. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
  23. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. Preprint, arXiv:2307.14539.
  24. Optimization-based prompt injection attack to llm-as-a-judge. arXiv preprint arXiv:2403.17710.
  25. Defending against physical adversarial patch attacks on infrared human detection. arXiv preprint arXiv:2309.15519.
  26. Better diffusion models further improve adversarial training. In International Conference on Machine Learning, pages 36246–36263. PMLR.
  27. {{\{{PatchCleanser}}\}}: Certifiably robust defense against adversarial patches for any image classifier. In 31st USENIX Security Symposium (USENIX Security 22), pages 2065–2082.
  28. Patchcure: Improving certifiable robustness, model utility, and computation efficiency of adversarial patch defenses. Preprint, arXiv:2310.13076.
  29. Densepure: Understanding diffusion models towards adversarial robustness. arXiv preprint arXiv:2211.00322.
  30. {{\{{DiffSmooth}}\}}: Certifiably robust learning via diffusion models and local smoothing. In 32nd USENIX Security Symposium (USENIX Security 23), pages 4787–4804.
  31. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  32. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41.
  33. Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems, 30(11):3212–3232.
  34. Towards defending against adversarial examples via attack-invariant features. In International conference on machine learning, pages 12835–12845. PMLR.
  35. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  36. Universal and transferable adversarial attacks on aligned language models. Preprint, arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jiachen Sun (29 papers)
  2. Changsheng Wang (9 papers)
  3. Jiongxiao Wang (15 papers)
  4. Yiwei Zhang (84 papers)
  5. Chaowei Xiao (110 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com