Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PIP: Detecting Adversarial Examples in Large Vision-Language Models via Attention Patterns of Irrelevant Probe Questions (2409.05076v1)

Published 8 Sep 2024 in cs.CV and cs.AI

Abstract: Large Vision-LLMs (LVLMs) have demonstrated their powerful multimodal capabilities. However, they also face serious safety problems, as adversaries can induce robustness issues in LVLMs through the use of well-designed adversarial examples. Therefore, LVLMs are in urgent need of detection tools for adversarial examples to prevent incorrect responses. In this work, we first discover that LVLMs exhibit regular attention patterns for clean images when presented with probe questions. We propose an unconventional method named PIP, which utilizes the attention patterns of one randomly selected irrelevant probe question (e.g., "Is there a clock?") to distinguish adversarial examples from clean examples. Regardless of the image to be tested and its corresponding question, PIP only needs to perform one additional inference of the image to be tested and the probe question, and then achieves successful detection of adversarial examples. Even under black-box attacks and open dataset scenarios, our PIP, coupled with a simple SVM, still achieves more than 98% recall and a precision of over 90%. Our PIP is the first attempt to detect adversarial attacks on LVLMs via simple irrelevant probe questions, shedding light on deeper understanding and introspection within LVLMs. The code is available at https://github.com/btzyd/pip.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems 35 (2022), 23716–23736.
  2. Image hijacks: Adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236 (2023).
  3. A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on Computational learning theory. 144–152.
  4. Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems 36 (2024).
  5. Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp). Ieee, 39–57.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems 36 (2024).
  7. Detecting adversarial examples via neural fingerprinting. arXiv preprint arXiv:1803.03870 (2018).
  8. How Robust is Google’s Bard to Adversarial Image Attacks? arXiv preprint arXiv:2309.11751 (2023).
  9. Maximum mean discrepancy test is aware of adversarial attacks. In International Conference on Machine Learning. PMLR, 3564–3575.
  10. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
  11. ALA: Naturalness-aware Adversarial Lightness Attack. In Proceedings of the 31st ACM International Conference on Multimedia (Ottawa ON, Canada) (MM ’23). Association for Computing Machinery, New York, NY, USA, 2418–2426. https://doi.org/10.1145/3581783.3611914
  12. Is bert really robust? natural language attack on text classification and entailment. arXiv preprint arXiv:1907.11932 2 (2019), 10.
  13. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
  14. Bert-attack: Adversarial attack against bert using bert. arXiv preprint arXiv:2004.09984 (2020).
  15. Xin Li and Fuxin Li. 2017. Adversarial examples detection in deep networks with convolutional filter statistics. In Proceedings of the IEEE international conference on computer vision. 5764–5772.
  16. Visual instruction tuning. Advances in neural information processing systems 36 (2024).
  17. Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 102–111.
  18. An image is worth 1000 lies: Transferability of adversarial images across prompts on vision-language models. In The Twelfth International Conference on Learning Representations.
  19. Metaadvdet: Towards robust detection of evolving adversarial attacks. In Proceedings of the 27th ACM International Conference on Multimedia. 692–701.
  20. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
  21. Dongyu Meng and Hao Chen. 2017. Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. 135–147.
  22. Towards robust detection of adversarial examples. Advances in neural information processing systems 31 (2018).
  23. The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P). IEEE, 372–387.
  24. Visual adversarial examples jailbreak aligned large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 21527–21536.
  25. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  26. Enhancing Adversarial Robustness of Multi-modal Recommendation via Modality Balancing. In Proceedings of the 31st ACM International Conference on Multimedia (Ottawa ON, Canada) (MM ’23). Association for Computing Machinery, New York, NY, USA, 6274–6282. https://doi.org/10.1145/3581783.3612337
  27. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389 (2023).
  28. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems 36 (2024).
  29. Generating Transferable Adversarial Examples against Vision Transformers. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM ’22). Association for Computing Machinery, New York, NY, USA, 5181–5190. https://doi.org/10.1145/3503161.3547989
  30. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155 (2017).
  31. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. arXiv preprint arXiv:2310.04655 (2023).
  32. Towards Adversarial Attack on Vision-Language Pre-training Models. In Proceedings of the 30th ACM International Conference on Multimedia (Lisboa, Portugal) (MM ’22). Association for Computing Machinery, New York, NY, USA, 5005–5013. https://doi.org/10.1145/3503161.3547801
  33. Universal Adversarial Perturbations for Vision-Language Pre-trained Models. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (Washington DC, USA) (SIGIR ’24). Association for Computing Machinery, New York, NY, USA, 862–871. https://doi.org/10.1145/3626772.3657781
  34. On evaluating adversarial robustness of large vision-language models. Advances in Neural Information Processing Systems 36 (2024).
  35. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yudong Zhang (56 papers)
  2. Ruobing Xie (97 papers)
  3. Jiansheng Chen (41 papers)
  4. Xingwu Sun (32 papers)
  5. Yu Wang (939 papers)
Citations (1)