Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective (2404.19287v3)

Published 30 Apr 2024 in cs.CV

Abstract: Pretrained vision-LLMs (VLMs) like CLIP exhibit exceptional generalization across diverse downstream tasks. While recent studies reveal their vulnerability to adversarial attacks, research to date has primarily focused on enhancing the robustness of image encoders against image-based attacks, with defenses against text-based and multimodal attacks remaining largely unexplored. To this end, this work presents the first comprehensive study on improving the adversarial robustness of VLMs against attacks targeting image, text, and multimodal inputs. This is achieved by proposing multimodal contrastive adversarial training (MMCoA). Such an approach strengthens the robustness of both image and text encoders by aligning the clean text embeddings with adversarial image embeddings, and adversarial text embeddings with clean image embeddings. The robustness of the proposed MMCoA is examined against existing defense methods over image, text, and multimodal attacks on the CLIP model. Extensive experiments on 15 datasets across two tasks reveal the characteristics of different adversarial defense methods under distinct distribution shifts and dataset complexities across the three attack types. This paves the way for a unified framework of adversarial robustness against different modality attacks, opening up new possibilities for securing VLMs against multimodal attacks. The code is available at https://github.com/ElleZWQ/MMCoA.git.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Prompt-based distribution alignment for unsupervised domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 729–737.
  2. Enhancing robustness of machine learning systems via data transformations. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS). IEEE, 1–5.
  3. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. Springer, 446–461.
  4. Adversarial robustness: From self-supervised pre-training to fine-tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 699–708.
  5. Promptstyler: Prompt-driven style generation for source-free domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15702–15712.
  6. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3606–3613.
  7. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 215–223.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  9. A study of the effect of jpg compression on adversarial images. arXiv preprint arXiv:1608.00853 (2016).
  10. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop. IEEE, 178–178.
  11. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision 132, 2 (2024), 581–595.
  12. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
  13. Caltech-256 object category dataset. (2007).
  14. Shixiang Gu and Luca Rigazio. 2014. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068 (2014).
  15. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117 (2017).
  16. Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. In IGARSS 2018-2018 IEEE international geoscience and remote sensing symposium. IEEE, 204–207.
  17. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
  18. Visual prompt tuning. In European Conference on Computer Vision. Springer, 709–727.
  19. Is bert really robust? natural language attack on text classification and entailment. arXiv preprint arXiv:1907.11932 2 (2019), 10.
  20. Randomized adversarial training via taylor expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16447–16457.
  21. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19113–19122.
  22. Learning multiple layers of features from tiny images. (2009).
  23. Adversarial examples in the physical world. In Artificial intelligence safety and security. Chapman and Hall/CRC, 99–112.
  24. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271 (2018).
  25. Bert-attack: Adversarial attack against bert using bert. arXiv preprint arXiv:2004.09984 (2020).
  26. Timo Lüddecke and Alexander Ecker. 2022. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7086–7096.
  27. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
  28. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013).
  29. Understanding zero-shot adversarial robustness for large-scale models. arXiv preprint arXiv:2212.07016 (2022).
  30. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2574–2582.
  31. Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE, 722–729.
  32. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 3498–3505.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  34. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics. 1085–1097.
  35. Andrew Ross and Finale Doshi-Velez. 2018. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
  36. Regularizing deep networks using efficient layerwise adversarial training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
  37. Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems 35 (2022), 14274–14289.
  38. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
  39. Yfcc100m: The new data in multimedia research. Commun. ACM 59, 2 (2016), 64–73.
  40. Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness. arXiv preprint arXiv:2401.04350 (2024).
  41. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7959–7971.
  42. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 3485–3492.
  43. Adversarial Prompt Tuning for Vision-Language Models. arXiv preprint arXiv:2311.11261 (2023).
  44. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.
  45. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning. PMLR, 2165–2183.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Wanqi Zhou (9 papers)
  2. Shuanghao Bai (10 papers)
  3. Qibin Zhao (66 papers)
  4. Badong Chen (83 papers)
  5. Danilo P. Mandic (70 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com