Towards Generative Class Prompt Learning for Fine-grained Visual Recognition (2409.01835v2)
Abstract: Although foundational vision-LLMs (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM's semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges. The source code will be made available at: https://github.com/soumitri2001/GCPL.
- A neural space-time representation for text-to-image personalization. In SIGGRAPH, 2023.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Test-time adaptation with salip: A cascade of sam and clip for zero shot medical image segmentation. arXiv preprint arXiv:2404.06362, 2024.
- Universal guidance for diffusion models. In ICLR, 2024.
- Gazediff: A radiologist visual attention guided diffusion model for zero-shot disease classification. In Medical Imaging with Deep Learning, 2024.
- Lung and colon cancer histopathological image dataset (lc25000). arXiv preprint arXiv:1912.12142, 2019.
- Language models are few-shot learners. In NeurIPS, 2020.
- Peekaboo: Text to image diffusion models are zero-shot segmentors. arXiv preprint arXiv:2211.13224, 2022.
- Plot: Prompt learning with optimal transport for vision-language models. In ICLR, 2023.
- A simple framework for contrastive learning of visual representations. In ICML, 2020.
- A closer look at few-shot classification. In ICLR, 2019.
- Meta-baseline: Exploring simple meta-learning for few-shot learning. In ICCV, 2021.
- Text-to-image diffusion models are zero shot classifiers. NeurIPS, 2023.
- Freeseg-diff: Training-free open-vocabulary segmentation with diffusion models. arXiv preprint arXiv:2403.20105, 2024.
- Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
- Probabilistic model-agnostic meta-learning. In NeurIPS, 2018.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Lp++: A surprisingly strong linear probe for few-shot clip. arXiv preprint arXiv:2404.02285, 2024.
- Intriguing properties of generative classifiers. In ICLR, 2024.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Pre-training without natural images. In ACCV, 2020.
- Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS medicine, 16(1):e1002730, 2019.
- Maple: Multi-modal prompt learning. In CVPR, 2023a.
- Self-regulating prompts: Foundational model adaptation without forgetting. In ICCV, 2023b.
- Text-to-image diffusion models are great sketch-photo matchmakers. arXiv preprint arXiv:2403.07214, 2024a.
- You’ll never walk alone: A sketch and text duet for fine-grained image retrieval. arXiv preprint arXiv:2403.07222, 2024b.
- 3d object representations for fine-grained categorization. In ICCV Workshops, 2013.
- On episodes, prototypical networks, and few-shot learning. In NeurIPS, 2021.
- Your diffusion model is secretly a zero-shot classifier. In ICCV, 2023.
- Promptad: Zero-shot anomaly detection using text prompts. In WACV, 2024.
- Few-shot adaptation of multi-modal foundation models: A survey. arXiv preprint arXiv:2401.01736, 2024.
- Decoupled weight decay regularization. In ICLR, 2018.
- A closer look at few-shot classification again. In ICML, 2023.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- Md Ashraful Alam Milton. Automated skin lesion classification using ensemble of deep neural networks in isic 2018: Skin lesion analysis towards melanoma detection challenge. arXiv preprint arXiv:1901.10802, 2019.
- Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2018.
- Automated flower classification over a large number of classes. In ICVGIP, 2008.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Svl-adapter: Self-supervised adapter for vision-language pretrained models. arXiv preprint arXiv:2210.03794, 2022.
- Pytorch: An imperative style, high-performance deep learning library. 2019.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- Consistency-guided prompt learning for vision-language models. In ICLR, 2024.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Clip for all things zero-shot sketch-based image retrieval, fine-grained or not. In CVPR, 2023.
- Analyzing the potential of zero-shot recognition for document image classification. In ICDAR, 2021.
- Prototypical networks for few-shot learning. In NeurIPS, 2017.
- Seed quality prediction using computer vision and convolutional neural networks. In IEEE ICIRCA, 2022.
- Learning to compare: Relation network for few-shot learning. In CVPR, 2018.
- Data-free generalized zero-shot learning. In AAAI, 2024.
- Rethinking few-shot image classification: a good embedding is all you need? In ECCV, 2020.
- Attention is all you need. In NeurIPS, 2017.
- Matching networks for one shot learning. In NeurIPS, 2016.
- Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (CSUR), 53(3):1–34, 2020.
- Few-shot classification with feature map reconstruction networks. In CVPR, 2021.
- Bi-directional feature reconstruction network for fine-grained few-shot image classification. In AAAI, 2023.
- Parn: Position-aware relation networks for few-shot learning. In ICCV, 2019.
- A survey of efficient fine-tuning methods for vision-language models—prompt and adapter. Computers & Graphics, 2024.
- Visual-language prompt tuning with knowledge-guided context optimization. In CVPR, 2023.
- Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In CVPR, 2020.
- Vision-language models for vision tasks: A survey. IEEE TPAMI, 2024.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- Tip-adapter: Training-free clip-adapter for better vision-language modeling. In ECCV, 2022.
- Conditional prompt learning for vision-language models. In CVPR, 2022a.
- Learning to prompt for vision-language models. IJCV, 2022b.
- Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection. arXiv preprint arXiv:2310.18961, 2023a.
- Zegclip: Towards adapting clip for zero-shot semantic segmentation. In CVPR, 2023b.
- Prompt-aligned gradient for prompt tuning. In ICCV, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.