Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisiting the Robust Generalization of Adversarial Prompt Tuning (2405.11154v1)

Published 18 May 2024 in cs.CV and cs.AI

Abstract: Understanding the vulnerability of large-scale pre-trained vision-LLMs like CLIP against adversarial attacks is key to ensuring zero-shot generalization capacity on various downstream tasks. State-of-the-art defense mechanisms generally adopt prompt learning strategies for adversarial fine-tuning to improve the adversarial robustness of the pre-trained model while keeping the efficiency of adapting to downstream tasks. Such a setup leads to the problem of over-fitting which impedes further improvement of the model's generalization capacity on both clean and adversarial examples. In this work, we propose an adaptive Consistency-guided Adversarial Prompt Tuning (i.e., CAPT) framework that utilizes multi-modal prompt learning to enhance the alignment of image and text features for adversarial examples and leverage the strong generalization of pre-trained CLIP to guide the model-enhancing its robust generalization on adversarial examples while maintaining its accuracy on clean ones. We also design a novel adaptive consistency objective function to balance the consistency of adversarial inputs and clean inputs between the fine-tuning model and the pre-trained model. We conduct extensive experiments across 14 datasets and 4 data sparsity schemes (from 1-shot to full training data settings) to show the superiority of CAPT over other state-of-the-art adaption methods. CAPT demonstrated excellent performance in terms of the in-distribution performance and the generalization under input distribution shift and across datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022).
  2. Bridging the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Information Processing Systems 35 (2022), 33781–33794.
  3. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. Springer, 446–461.
  4. Visual prompting for adversarial robustness. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
  5. Plot: Prompt learning with optimal transport for vision-language models. arXiv preprint arXiv:2210.01253 (2022).
  6. Adversarial robustness: From self-supervised pre-training to fine-tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 699–708.
  7. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3606–3613.
  8. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670 (2020).
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  10. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305 (2020).
  11. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9185–9193.
  12. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop. IEEE, 178–178.
  13. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision 132, 2 (2024), 581–595.
  14. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
  15. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021).
  16. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12, 7 (2019), 2217–2226.
  17. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision. 8340–8349.
  18. Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1600–1610.
  19. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
  20. Visual prompt tuning. In European Conference on Computer Vision. Springer, 709–727.
  21. LAS-AT: adversarial training with learnable attack strategy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13398–13408.
  22. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19113–19122.
  23. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15190–15200.
  24. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops. 554–561.
  25. Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022).
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
  27. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning. PMLR, 12888–12900.
  28. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. arXiv preprint arXiv:2403.01849 (2024).
  29. Language-Driven Anchors for Zero-Shot Adversarial Robustness. arXiv preprint arXiv:2301.13096 (2023).
  30. Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22511–22521.
  31. Weizhe Lin and Bill Byrne. 2022. Retrieval augmented visual question answering with outside knowledge. arXiv preprint arXiv:2210.03809 (2022).
  32. Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 102–111.
  33. Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5206–5215.
  34. Class-agnostic object detection with multi-modal transformer. In European conference on computer vision. Springer, 512–531.
  35. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
  36. Towards Deep Learning Models Resistant to Adversarial Attacks. In International Conference on Learning Representations.
  37. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013).
  38. Understanding zero-shot adversarial robustness for large-scale models. arXiv preprint arXiv:2212.07016 (2022).
  39. Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE, 722–729.
  40. Robustness and accuracy could be reconcilable by (proper) definition. In International Conference on Machine Learning. PMLR, 17258–17277.
  41. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 3498–3505.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  43. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18082–18091.
  44. Do imagenet classifiers generalize to imagenet?. In International conference on machine learning. PMLR, 5389–5400.
  45. Overfitting in adversarially robust deep learning. In International conference on machine learning. PMLR, 8093–8104.
  46. Do adversarially robust imagenet models transfer better? Advances in Neural Information Processing Systems 33 (2020), 3533–3545.
  47. Christian Schlarmann and Matthias Hein. 2023. On the adversarial robustness of multi-modal foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3677–3685.
  48. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
  49. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
  50. Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness. arXiv preprint arXiv:2401.04350 (2024).
  51. Balance, imbalance, and rebalance: Understanding robust overfitting from a minimax game perspective. Advances in neural information processing systems 36 (2024).
  52. Improving adversarial robustness requires revisiting misclassified examples. In International conference on learning representations.
  53. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 3485–3492.
  54. Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6757–6767.
  55. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. arXiv preprint arXiv:2310.04655 (2023).
  56. How transferable are features in deep neural networks? Advances in neural information processing systems 27 (2014).
  57. Open-vocabulary detr with conditional matching. In European Conference on Computer Vision. Springer, 106–122.
  58. Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning. PMLR, 7472–7482.
  59. Towards adversarial attack on vision-language pre-training models. In Proceedings of the 30th ACM International Conference on Multimedia. 5005–5013.
  60. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
  61. On evaluating adversarial robustness of large vision-language models. Advances in Neural Information Processing Systems 36 (2024).
  62. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16816–16825.
  63. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.
  64. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 13041–13049.
  65. Prompt-based learning for unpaired image captioning. IEEE Transactions on Multimedia (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Fan Yang (878 papers)
  2. Mingxuan Xia (4 papers)
  3. Sangzhou Xia (1 paper)
  4. Chicheng Ma (4 papers)
  5. Hui Hui (7 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com