Learning to Prompt with Text Only Supervision for Vision-Language Models
Abstract: Foundational vision-LLMs such as CLIP are becoming a new paradigm in vision, due to their excellent generalization abilities. However, adapting these models for downstream tasks while maintaining their generalization remains a challenge. In literature, one branch of methods adapts CLIP by learning prompts using visual information. While effective, most of these works require labeled data which is not practical, and often struggle to generalize towards new datasets due to over-fitting on the source data. An alternative approach resorts to training-free methods by generating class descriptions from LLMs and perform prompt ensembling. However, these methods often generate class specific prompts that cannot be transferred to other classes, which incur higher costs by generating LLM descriptions for each class separately. In this work, we propose to combine the strengths of these both streams of methods by learning prompts using only text data derived from LLMs. As supervised training of prompts is not trivial due to absence of images, we develop a training approach that allows prompts to extract rich contextual knowledge from LLM data. Moreover, with LLM contextual data mapped within the learned prompts, it enables zero-shot transfer of prompts to new classes and datasets potentially cutting the LLM prompt engineering cost. To the best of our knowledge, this is the first work that learns generalized prompts using text only data. We perform extensive evaluations on 4 benchmarks where our method improves over prior ensembling works while being competitive to those utilizing labeled images. Our code and pre-trained models are available at https://github.com/muzairkhattak/ProText.
- More context, less distraction: Improving zero-shot inference of clip by inferring and describing spurious features. In Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023.
- Visual prompting: Modifying pixel space to adapt pre-trained models. arXiv preprint arXiv:2203.17274, 2022.
- Bridging the gap between object and image-level representations for open-vocabulary detection. NeurIPS, 35:33781–33794, 2022.
- Food-101–mining discriminative components with random forests. In ECCV, pages 446–461. Springer, 2014.
- Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- Plot: Prompt learning with optimal transport for vision-language models. In ICLR, 2022.
- Describing textures in the wild. In CVPR, pages 3606–3613, 2014.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
- Bayesian prompt learning for image-language model generalization. In CVPR, pages 15237–15246, 2023.
- Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR, pages 14084–14093, 2022.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR Workshop, pages 178–178. IEEE, 2004.
- Clip-adapter: Better vision-language models with feature adapters. IJCV, pages 1–15, 2023.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, pages 540–557. Springer, 2022.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. J-STARS, 12(7):2217–2226, 2019.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, pages 8340–8349, 2021a.
- Natural adversarial examples. In CVPR, pages 15262–15271, 2021b.
- Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICLR, pages 4904–4916. PMLR, 2021.
- A good prompt is worth millions of parameters? low-resource prompt-based learning for vision-language models. arXiv preprint arXiv:2110.08484, 2021.
- Maple: Multi-modal prompt learning. In CVPR, pages 19113–19122, 2023a.
- Self-regulating prompts: Foundational model adaptation without forgetting. In ICCV, pages 15190–15200, 2023b.
- 3d object representations for fine-grained categorization. In ICCV, pages 554–561, 2013.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Language-driven semantic segmentation, 2022.
- Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, pages 7061–7070, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Prompt distribution learning. In CVPR, pages 5206–5215, 2022.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- Visual classification via description from large language models. In ICLR, 2023.
- Simple open-vocabulary object detection. In ECCV, pages 728–755. Springer, 2022.
- I2dformer: Learning image to document attention for zero-shot image classification. NeurIPS, 2022.
- I2mvformer: Large language model generated multi-view document supervision for zero-shot image classification. In CVPR, 2023a.
- Silc: Improving vision language pretraining with self-distillation. arXiv preprint arXiv:2310.13355, 2023b.
- Automated flower classification over a large number of classes. In ICVGIP, pages 722–729. IEEE, 2008.
- Cats and dogs. In CVPR, pages 3498–3505. IEEE, 2012.
- What does a platypus look like? generating customized prompts for zero-shot image classification. In ICCV, pages 15691–15701, 2023.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, pages 5389–5400. PMLR, 2019.
- Waffling around for performance: Visual classification with random words and broad concepts. 2023.
- Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Test-time prompt tuning for zero-shot generalization in vision-language models. NeurIPS, 35:14274–14289, 2022.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Stanford alpaca: An instruction-following llama model, 2023.
- Learning robust global representations by penalizing local predictive power. In NeurIPS, 2019.
- Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, pages 3485–3492. IEEE, 2010.
- Filip: Fine-grained interactive language-image pre-training. In ICLR, 2021.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022a.
- Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022b.
- Detecting twenty-thousand classes using image-level supervision. In ECCV, pages 350–368. Springer, 2022c.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.