Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Concept-Guided Prompt Learning for Generalization in Vision-Language Models (2401.07457v1)

Published 15 Jan 2024 in cs.CV

Abstract: Contrastive Language-Image Pretraining (CLIP) model has exhibited remarkable efficacy in establishing cross-modal connections between texts and images, yielding impressive performance across a broad spectrum of downstream applications through fine-tuning. However, for generalization tasks, the current fine-tuning methods for CLIP, such as CoOp and CoCoOp, demonstrate relatively low performance on some fine-grained datasets. We recognize the underlying reason is that these previous methods only projected global features into the prompt, neglecting the various visual concepts, such as colors, shapes, and sizes, which are naturally transferable across domains and play a crucial role in generalization tasks. To address this issue, in this work, we propose Concept-Guided Prompt Learning (CPL) for vision-LLMs. Specifically, we leverage the well-learned knowledge of CLIP to create a visual concept cache to enable concept-guided prompting. In order to refine the text features, we further develop a projector that transforms multi-level visual features into text features. We observe that this concept-guided prompt learning approach is able to achieve enhanced consistency between visual and linguistic modalities. Extensive experimental results demonstrate that our CPL method significantly improves generalization capabilities compared to the current state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, volume 35, 23716–23736.
  2. Food-101–mining discriminative components with random forests. In European Conference on Computer Vision, 446–461.
  3. PLOT: Prompt learning with optimal transport for vision-language models. In International Conference on Learning Representations.
  4. Describing textures in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3606–3613.
  5. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 248–255.
  6. Multi-modal alignment using representation codebook. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15651–15660.
  7. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 178.
  8. A bayesian hierarchical model for learning natural scene categories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, volume 2, 524–531.
  9. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 1–15.
  10. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7): 2217–2226.
  11. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8340–8349.
  12. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15262–15271.
  13. Learning to adapt CLIP for few-shot monocular depth estimation. arXiv preprint arXiv:2311.01034.
  14. Unsupervised learning of discriminative attributes and visual representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5175–5184.
  15. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, 4904–4916.
  16. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19113–19122.
  17. 3D object representations for fine-grained categorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 554–561.
  18. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10965–10975.
  19. Feature pyramid networks for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2117–2125.
  20. Recognizing human actions by attributes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3337–3344.
  21. Simpler is better: Few-shot semantic segmentation with classifier weight transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8741–8750.
  22. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
  23. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing, 722–729. IEEE.
  24. Cats and dogs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3498–3505.
  25. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2751–2758.
  26. Coco attributes: Attributes for people, animals, and objects. In European Conference on Computer Vision, 85–100.
  27. Learning to predict visual attributes in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13018–13028.
  28. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 8748–8763. PMLR.
  29. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18082–18091.
  30. Do imagenet classifiers generalize to imagenet? In International Conference on Machine Learning, 5389–5400. PMLR.
  31. Test-time prompt tuning for zero-shot generalization in vision-language models. In Advances in Neural Information Processing Systems, volume 35, 14274–14289.
  32. APPLeNet: Visual attention parameterized promptlearning for few-Shot remote sensing image generalization using clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024–2034.
  33. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
  34. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 6000–6010.
  35. Learning to decompose visual features with latent textual prompts. In International Conference on Learning Representations.
  36. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, volume 32, 10506–10518.
  37. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 568–578.
  38. Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3485–3492.
  39. Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6757–6767.
  40. CoCa: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research.
  41. Task residual for tuning vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10899–10909.
  42. Vt-clip: Enhancing vision-language models with visual-guided texts. arXiv preprint arXiv:2112.02399.
  43. Tip-adapter: Training-free adaption of clip for few-shot classification. In European Conference on Computer Vision, 493–510. Springer.
  44. BDC-Adapter: Brownian distance covariance for better vision-language reasoning. In British Machine Vision Conference.
  45. Cross-Modal Concept Learning and Inference for Vision-Language Models. arXiv preprint arXiv:2307.15460.
  46. A large-scale attribute dataset for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 398–407.
  47. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16816–16825.
  48. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
  49. Prompt-aligned gradient for prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 15659–15669.
  50. Not all features matter: Enhancing few-shot clip with adaptive prior refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2605–2615.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yi Zhang (994 papers)
  2. Ce Zhang (215 papers)
  3. Ke Yu (44 papers)
  4. Yushun Tang (10 papers)
  5. Zhihai He (37 papers)
Citations (11)