Revisiting the Robust Generalization of Adversarial Prompt Tuning (2405.11154v1)
Abstract: Understanding the vulnerability of large-scale pre-trained vision-LLMs like CLIP against adversarial attacks is key to ensuring zero-shot generalization capacity on various downstream tasks. State-of-the-art defense mechanisms generally adopt prompt learning strategies for adversarial fine-tuning to improve the adversarial robustness of the pre-trained model while keeping the efficiency of adapting to downstream tasks. Such a setup leads to the problem of over-fitting which impedes further improvement of the model's generalization capacity on both clean and adversarial examples. In this work, we propose an adaptive Consistency-guided Adversarial Prompt Tuning (i.e., CAPT) framework that utilizes multi-modal prompt learning to enhance the alignment of image and text features for adversarial examples and leverage the strong generalization of pre-trained CLIP to guide the model-enhancing its robust generalization on adversarial examples while maintaining its accuracy on clean ones. We also design a novel adaptive consistency objective function to balance the consistency of adversarial inputs and clean inputs between the fine-tuning model and the pre-trained model. We conduct extensive experiments across 14 datasets and 4 data sparsity schemes (from 1-shot to full training data settings) to show the superiority of CAPT over other state-of-the-art adaption methods. CAPT demonstrated excellent performance in terms of the in-distribution performance and the generalization under input distribution shift and across datasets.
- Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022).
- Bridging the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Information Processing Systems 35 (2022), 33781–33794.
- Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. Springer, 446–461.
- Visual prompting for adversarial robustness. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- Plot: Prompt learning with optimal transport for vision-language models. arXiv preprint arXiv:2210.01253 (2022).
- Adversarial robustness: From self-supervised pre-training to fine-tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 699–708.
- Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3606–3613.
- Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670 (2020).
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
- Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305 (2020).
- Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition. 9185–9193.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop. IEEE, 178–178.
- Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision 132, 2 (2024), 581–595.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
- Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921 (2021).
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12, 7 (2019), 2217–2226.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision. 8340–8349.
- Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1600–1610.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
- Visual prompt tuning. In European Conference on Computer Vision. Springer, 709–727.
- LAS-AT: adversarial training with learnable attack strategy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13398–13408.
- Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19113–19122.
- Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15190–15200.
- 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops. 554–561.
- Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546 (2022).
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning. PMLR, 19730–19742.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning. PMLR, 12888–12900.
- One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. arXiv preprint arXiv:2403.01849 (2024).
- Language-Driven Anchors for Zero-Shot Adversarial Robustness. arXiv preprint arXiv:2301.13096 (2023).
- Gligen: Open-set grounded text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22511–22521.
- Weizhe Lin and Bill Byrne. 2022. Retrieval augmented visual question answering with outside knowledge. arXiv preprint arXiv:2210.03809 (2022).
- Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 102–111.
- Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5206–5215.
- Class-agnostic object detection with multi-modal transformer. In European conference on computer vision. Springer, 512–531.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
- Towards Deep Learning Models Resistant to Adversarial Attacks. In International Conference on Learning Representations.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013).
- Understanding zero-shot adversarial robustness for large-scale models. arXiv preprint arXiv:2212.07016 (2022).
- Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE, 722–729.
- Robustness and accuracy could be reconcilable by (proper) definition. In International Conference on Machine Learning. PMLR, 17258–17277.
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 3498–3505.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 18082–18091.
- Do imagenet classifiers generalize to imagenet?. In International conference on machine learning. PMLR, 5389–5400.
- Overfitting in adversarially robust deep learning. In International conference on machine learning. PMLR, 8093–8104.
- Do adversarially robust imagenet models transfer better? Advances in Neural Information Processing Systems 33 (2020), 3533–3545.
- Christian Schlarmann and Matthias Hein. 2023. On the adversarial robustness of multi-modal foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3677–3685.
- UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013).
- Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness. arXiv preprint arXiv:2401.04350 (2024).
- Balance, imbalance, and rebalance: Understanding robust overfitting from a minimax game perspective. Advances in neural information processing systems 36 (2024).
- Improving adversarial robustness requires revisiting misclassified examples. In International conference on learning representations.
- Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 3485–3492.
- Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6757–6767.
- Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. arXiv preprint arXiv:2310.04655 (2023).
- How transferable are features in deep neural networks? Advances in neural information processing systems 27 (2014).
- Open-vocabulary detr with conditional matching. In European Conference on Computer Vision. Springer, 106–122.
- Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning. PMLR, 7472–7482.
- Towards adversarial attack on vision-language pre-training models. In Proceedings of the 30th ACM International Conference on Multimedia. 5005–5013.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3836–3847.
- On evaluating adversarial robustness of large vision-language models. Advances in Neural Information Processing Systems 36 (2024).
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16816–16825.
- Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.
- Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 13041–13049.
- Prompt-based learning for unpaired image captioning. IEEE Transactions on Multimedia (2023).
- Fan Yang (878 papers)
- Mingxuan Xia (4 papers)
- Sangzhou Xia (1 paper)
- Chicheng Ma (4 papers)
- Hui Hui (7 papers)