Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models (2311.16494v2)

Published 27 Nov 2023 in cs.CV

Abstract: Although soft prompt tuning is effective in efficiently adapting Vision-Language (V&L) models for downstream tasks, it shows limitations in dealing with distribution shifts. We address this issue with Attribute-Guided Prompt Tuning (ArGue), making three key contributions. 1) In contrast to the conventional approach of directly appending soft prompts preceding class names, we align the model with primitive visual attributes generated by LLMs. We posit that a model's ability to express high confidence in these attributes signifies its capacity to discern the correct class rationales. 2) We introduce attribute sampling to eliminate disadvantageous attributes, thus only semantically meaningful attributes are preserved. 3) We propose negative prompting, explicitly enumerating class-agnostic attributes to activate spurious correlations and encourage the model to generate highly orthogonal probability distributions in relation to these negative features. In experiments, our method significantly outperforms current state-of-the-art prompt tuning methods on both novel class prediction and out-of-distribution generalization tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. Food-101 - mining discriminative components with random forests. In ECCV (6), pages 446–461. Springer, 2014.
  3. Language models are few-shot learners. In NeurIPS, 2020.
  4. LASP: text-to-text optimization for language-aware soft prompting of vision & language models. In CVPR, pages 23232–23241. IEEE, 2023.
  5. Describing textures in the wild. In CVPR, pages 3606–3613. IEEE Computer Society, 2014.
  6. Commonsense knowledge mining from pretrained models. In EMNLP/IJCNLP (1), pages 1173–1178. Association for Computational Linguistics, 2019.
  7. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE Computer Society, 2009.
  8. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics, 2019.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR. OpenReview.net, 2021.
  10. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR Workshops, page 178. IEEE Computer Society, 2004.
  11. Making pre-trained language models better few-shot learners. In ACL/IJCNLP (1), pages 3816–3830. Association for Computational Linguistics, 2021.
  12. Bertese: Learning to speak to BERT. In EACL, pages 3618–3623. Association for Computational Linguistics, 2021.
  13. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 12(7):2217–2226, 2019.
  14. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, pages 8320–8329. IEEE, 2021a.
  15. Natural adversarial examples. In CVPR, pages 15262–15271. Computer Vision Foundation / IEEE, 2021b.
  16. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916. PMLR, 2021.
  17. How can we know what language models know. Trans. Assoc. Comput. Linguistics, 8:423–438, 2020.
  18. Exposing and mitigating spurious correlations for cross-modal retrieval. In CVPR Workshops, pages 2585–2595. IEEE, 2023.
  19. 3d object representations for fine-grained categorization. In ICCV Workshops, pages 554–561. IEEE Computer Society, 2013.
  20. The power of scale for parameter-efficient prompt tuning. In EMNLP (1), pages 3045–3059. Association for Computational Linguistics, 2021.
  21. Reducing retraining by recycling parameter-efficient prompts. CoRR, abs/2208.05577, 2022.
  22. Prefix-tuning: Optimizing continuous prompts for generation. In ACL/IJCNLP (1), pages 4582–4597. Association for Computational Linguistics, 2021.
  23. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9):195:1–195:35, 2023a.
  24. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In ACL (2), pages 61–68. Association for Computational Linguistics, 2022.
  25. Hierarchical prompt learning for multi-task learning. In CVPR, pages 10888–10898. IEEE, 2023b.
  26. Fine-grained visual classification of aircraft. CoRR, abs/1306.5151, 2013.
  27. Doubly right object recognition: A why prompt for visual rationales. In CVPR, pages 2722–2732. IEEE, 2023.
  28. Visual classification via description from large language models. In ICLR. OpenReview.net, 2023.
  29. Automated flower classification over a large number of classes. In ICVGIP, pages 722–729. IEEE Computer Society, 2008.
  30. Cats and dogs. In CVPR, pages 3498–3505. IEEE Computer Society, 2012.
  31. What does a platypus look like? generating customized prompts for zero-shot image classification. In ICCV, pages 15691–15701, 2023.
  32. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  33. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
  34. Do imagenet classifiers generalize to imagenet? In ICML, pages 5389–5400. PMLR, 2019.
  35. Waffling around for performance: Visual classification with random words and broad concepts. In ICCV, pages 15746–15757, 2023.
  36. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626. IEEE Computer Society, 2017.
  37. A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision, 2(11), 2012.
  38. Spot: Better frozen model adaptation through soft prompt transfer. In ACL (1), pages 5039–5059. Association for Computational Linguistics, 2022.
  39. Universal adversarial triggers for attacking and analyzing NLP. In EMNLP/IJCNLP (1), pages 2153–2162. Association for Computational Linguistics, 2019.
  40. Learning robust global representations by penalizing local predictive power. In NeurIPS, pages 10506–10518, 2019.
  41. SUN database: Large-scale scene recognition from abbey to zoo. In CVPR, pages 3485–3492. IEEE Computer Society, 2010.
  42. Learning concise and descriptive attributes for visual recognition. In ICCV, pages 3090–3100, 2023.
  43. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In CVPR, pages 19187–19197. IEEE, 2023.
  44. Learning to prompt for vision-language models. Int. J. Comput. Vis., 130(9):2337–2348, 2022a.
  45. Conditional prompt learning for vision-language models. In CVPR, pages 16795–16804. IEEE, 2022b.
  46. Debiased fine-tuning for vision-language models by prompt regularization. In AAAI, pages 3834–3842. AAAI Press, 2023.
Citations (8)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com