Papers
Topics
Authors
Recent
Search
2000 character limit reached

Few-Shot Adversarial Prompt Learning on Vision-Language Models

Published 21 Mar 2024 in cs.CV, cs.CL, cs.CR, and cs.LG | (2403.14774v2)

Abstract: The vulnerability of deep neural networks to imperceptible adversarial perturbations has attracted widespread attention. Inspired by the success of vision-language foundation models, previous efforts achieved zero-shot adversarial robustness by aligning adversarial visual features with text supervision. However, in practice, they are still unsatisfactory due to several issues, including heavy adaptation cost, suboptimal text supervision, and uncontrolled natural generalization capacity. In this paper, to address these issues, we propose a few-shot adversarial prompt framework where adapting input sequences with limited data makes significant adversarial robustness improvement. Specifically, we achieve this by providing adversarially correlated text supervision that is end-to-end learned from adversarial examples. We also propose a novel training objective that enhances the consistency of multi-modal features while encourages differentiated uni-modal features between natural and adversarial examples. The proposed framework gives access to learn adversarial text supervision, which provides superior cross-modal adversarial alignment and matches state-of-the-art zero-shot adversarial robustness with only 1% training data. Code is available at: https://github.com/lionel-w2/FAP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 2022.
  2. Visual prompting via image inpainting. In NeurIPS, pp.  25005–25017, 2022.
  3. Food-101–mining discriminative components with random forests. In ECCV, pp.  446–461, 2014.
  4. Artificial intelligence in medicine: current trends and future possibilities. British Journal of General Practice, 68(668):143–144, 2018.
  5. Clip2scene: Towards label-efficient 3d scene understanding by clip. In CVPR, pp.  7020–7030, 2023.
  6. A simple framework for contrastive learning of visual representations. In ICML, pp.  1597–1607, 2020a.
  7. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
  8. Describing textures in the wild. In CVPR, pp.  3606–3613, 2014.
  9. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In ICML, pp.  2206–2216, 2020.
  10. Imagenet: A large-scale hierarchical image database. In CVPR, pp.  248–255, 2009.
  11. Does language help generalization in vision models? arXiv preprint arXiv:2104.08313, 2021.
  12. Improving adversarially robust few-shot image classification with generalizable representations. In CVPR, pp.  9025–9034, 2022.
  13. How robust is google’s bard to adversarial image attacks? arXiv preprint arXiv:2309.11751, 2023.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. When does contrastive learning preserve adversarial robustness from pretraining to finetuning? In NeurIPS, pp.  21480–21492, 2021.
  16. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR workshop, pp.  178–178, 2004.
  17. Adversarial attacks on medical machine learning. Science, 363(6433):1287–1289, 2019.
  18. Adversarially robust few-shot learning: A meta-learning approach. In NeurIPS, pp.  17886–17895, 2020.
  19. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  20. Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
  21. Deep residual learning for image recognition. In CVPR, pp.  770–778, 2016.
  22. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  23. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pp.  4904–4916, 2021.
  24. Visual prompt tuning. In ECCV, pp.  709–727, 2022.
  25. Robust pre-training by adversarial contrastive learning. In NeurIPS, pp.  16199–16210, 2020.
  26. Maple: Multi-modal prompt learning. In CVPR, pp.  19113–19122, 2023a.
  27. Self-regulating prompts: Foundational model adaptation without forgetting. In ICCV, pp.  15190–15200, 2023b.
  28. Adversarial self-supervised contrastive learning. In NeurIPS, pp.  2983–2994, 2020.
  29. Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, pp.  5583–5594, 2021.
  30. 3d object representations for fine-grained categorization. In ICCV workshops, pp.  554–561, 2013.
  31. Imagenet classification with deep convolutional neural networks. In NeurIPS, 2012.
  32. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  33. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  34. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, pp.  12888–12900, 2022a.
  35. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  36. One prompt word is enough to boost adversarial robustness for pre-trained vision-language models. In CVPR, 2024.
  37. Grounded language-image pre-training. In CVPR, pp.  10965–10975, 2022b.
  38. Anchor-based adversarially robust zero-shot learning driven by language. arXiv preprint arXiv:2301.13096, 2023b.
  39. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023a.
  40. Gpt understands, too. AI Open, 2023b.
  41. Prompt distribution learning. In CVPR, pp.  5206–5215, 2022.
  42. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508:293–304, 2022.
  43. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  44. On the robustness of vision transformers to adversarial examples. In ICCV, pp.  7838–7847, 2021.
  45. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  46. Understanding zero-shot adversarial robustness for large-scale models. arXiv preprint arXiv:2212.07016, 2022.
  47. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pp.  722–729, 2008.
  48. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  49. Cats and dogs. In CVPR, pp.  3498–3505, 2012.
  50. Learning transferable visual models from natural language supervision. In ICML, pp.  8748–8763, 2021.
  51. Test-time prompt tuning for zero-shot generalization in vision-language models. In NeurIPS, pp.  14274–14289, 2022.
  52. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  53. A simple approach to adversarial robustness in few-shot image classification. arXiv preprint arXiv:2204.05432, 2022.
  54. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  55. Simulation-based adversarial test generation for autonomous vehicles with machine learning components. In 2018 IEEE Intelligent Vehicles Symposium (IV), pp.  1555–1562, 2018.
  56. Clipasso: Semantically-aware object sketching. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022.
  57. Self-ensemble adversarial training for improved robustness. arXiv preprint arXiv:2203.09678, 2022.
  58. On fast adversarial robustness adaptation in model-agnostic meta-learning. arXiv preprint arXiv:2102.10454, 2021.
  59. Improving adversarial robustness requires revisiting misclassified examples. In ICLR, 2020.
  60. Towards efficient adversarial training on vision transformers. In ECCV, pp.  307–325, 2022.
  61. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, pp.  3485–3492, 2010.
  62. Groupvit: Semantic segmentation emerges from text supervision. In CVPR, pp.  18134–18144, 2022.
  63. Efficient adversarial contrastive learning via robustness-aware coreset selection. arXiv preprint arXiv:2302.03857, 2023a.
  64. Enhancing adversarial contrastive learning via adversarial invariant regularization. arXiv preprint arXiv:2305.00374, 2023b.
  65. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
  66. Adversarial meta-learning. arXiv preprint arXiv:1806.03316, 2018.
  67. Understanding robust overfitting of adversarial training and beyond. In ICML, pp.  25595–25610, 2022a.
  68. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022b.
  69. Adversarial contrastive learning via asymmetric infonce. In ECCV, pp.  53–69, 2022c.
  70. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  71. A deep dive into adversarial robustness in zero-shot learning. In ECCV, pp.  3–21, 2020.
  72. Lit: Zero-shot transfer with locked-image text tuning. In CVPR, pp.  18123–18133, 2022.
  73. Decoupled adversarial contrastive learning for self-supervised adversarial robustness. In ECCV, pp.  725–742, 2022a.
  74. Theoretically principled trade-off between robustness and accuracy. In ICML, pp.  7472–7482, 2019.
  75. Attacks which do not kill training make adversarial learning stronger. In ICML, pp.  11278–11287, 2020.
  76. Towards adversarial attack on vision-language pre-training models. In ACM MM, pp.  5005–5013, 2022b.
  77. Adversarial prompt tuning for vision-language models. arXiv preprint arXiv:2311.11261, 2023a.
  78. Atzsl: Defensive zero-shot recognition in the presence of adversaries. IEEE Transactions on Multimedia, 2023b.
  79. Causaladv: Adversarial robustness through the lens of causality. arXiv preprint arXiv:2106.06196, 2021.
  80. Conditional prompt learning for vision-language models. In CVPR, pp.  16816–16825, 2022a.
  81. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
  82. Prompt-aligned gradient for prompt tuning. In ICCV, pp.  15659–15669, 2023.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.