Papers
Topics
Authors
Recent
2000 character limit reached

Towards Generative Class Prompt Learning for Fine-grained Visual Recognition (2409.01835v2)

Published 3 Sep 2024 in cs.CV and cs.CL

Abstract: Although foundational vision-LLMs (VLMs) have proven to be very successful for various semantic discrimination tasks, they still struggle to perform faithfully for fine-grained categorization. Moreover, foundational models trained on one domain do not generalize well on a different domain without fine-tuning. We attribute these to the limitations of the VLM's semantic representations and attempt to improve their fine-grained visual awareness using generative modeling. Specifically, we propose two novel methods: Generative Class Prompt Learning (GCPL) and Contrastive Multi-class Prompt Learning (CoMPLe). Utilizing text-to-image diffusion models, GCPL significantly improves the visio-linguistic synergy in class embeddings by conditioning on few-shot exemplars with learnable class prompts. CoMPLe builds on this foundation by introducing a contrastive learning component that encourages inter-class separation during the generative optimization process. Our empirical results demonstrate that such a generative class prompt learning approach substantially outperform existing methods, offering a better alternative to few shot image recognition challenges. The source code will be made available at: https://github.com/soumitri2001/GCPL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. A neural space-time representation for text-to-image personalization. In SIGGRAPH, 2023.
  2. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  3. Test-time adaptation with salip: A cascade of sam and clip for zero shot medical image segmentation. arXiv preprint arXiv:2404.06362, 2024.
  4. Universal guidance for diffusion models. In ICLR, 2024.
  5. Gazediff: A radiologist visual attention guided diffusion model for zero-shot disease classification. In Medical Imaging with Deep Learning, 2024.
  6. Lung and colon cancer histopathological image dataset (lc25000). arXiv preprint arXiv:1912.12142, 2019.
  7. Language models are few-shot learners. In NeurIPS, 2020.
  8. Peekaboo: Text to image diffusion models are zero-shot segmentors. arXiv preprint arXiv:2211.13224, 2022.
  9. Plot: Prompt learning with optimal transport for vision-language models. In ICLR, 2023.
  10. A simple framework for contrastive learning of visual representations. In ICML, 2020.
  11. A closer look at few-shot classification. In ICLR, 2019.
  12. Meta-baseline: Exploring simple meta-learning for few-shot learning. In ICCV, 2021.
  13. Text-to-image diffusion models are zero shot classifiers. NeurIPS, 2023.
  14. Freeseg-diff: Training-free open-vocabulary segmentation with diffusion models. arXiv preprint arXiv:2403.20105, 2024.
  15. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 2022.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  17. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  18. Probabilistic model-agnostic meta-learning. In NeurIPS, 2018.
  19. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
  20. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  21. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  22. Lp++: A surprisingly strong linear probe for few-shot clip. arXiv preprint arXiv:2404.02285, 2024.
  23. Intriguing properties of generative classifiers. In ICLR, 2024.
  24. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  25. Pre-training without natural images. In ACCV, 2020.
  26. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS medicine, 16(1):e1002730, 2019.
  27. Maple: Multi-modal prompt learning. In CVPR, 2023a.
  28. Self-regulating prompts: Foundational model adaptation without forgetting. In ICCV, 2023b.
  29. Text-to-image diffusion models are great sketch-photo matchmakers. arXiv preprint arXiv:2403.07214, 2024a.
  30. You’ll never walk alone: A sketch and text duet for fine-grained image retrieval. arXiv preprint arXiv:2403.07222, 2024b.
  31. 3d object representations for fine-grained categorization. In ICCV Workshops, 2013.
  32. On episodes, prototypical networks, and few-shot learning. In NeurIPS, 2021.
  33. Your diffusion model is secretly a zero-shot classifier. In ICCV, 2023.
  34. Promptad: Zero-shot anomaly detection using text prompts. In WACV, 2024.
  35. Few-shot adaptation of multi-modal foundation models: A survey. arXiv preprint arXiv:2401.01736, 2024.
  36. Decoupled weight decay regularization. In ICLR, 2018.
  37. A closer look at few-shot classification again. In ICML, 2023.
  38. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  39. Md Ashraful Alam Milton. Automated skin lesion classification using ensemble of deep neural networks in isic 2018: Skin lesion analysis towards melanoma detection challenge. arXiv preprint arXiv:1901.10802, 2019.
  40. Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999, 2018.
  41. Automated flower classification over a large number of classes. In ICVGIP, 2008.
  42. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  43. Svl-adapter: Self-supervised adapter for vision-language pretrained models. arXiv preprint arXiv:2210.03794, 2022.
  44. Pytorch: An imperative style, high-performance deep learning library. 2019.
  45. Learning transferable visual models from natural language supervision. In ICML, 2021.
  46. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  47. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  48. Consistency-guided prompt learning for vision-language models. In ICLR, 2024.
  49. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  50. Clip for all things zero-shot sketch-based image retrieval, fine-grained or not. In CVPR, 2023.
  51. Analyzing the potential of zero-shot recognition for document image classification. In ICDAR, 2021.
  52. Prototypical networks for few-shot learning. In NeurIPS, 2017.
  53. Seed quality prediction using computer vision and convolutional neural networks. In IEEE ICIRCA, 2022.
  54. Learning to compare: Relation network for few-shot learning. In CVPR, 2018.
  55. Data-free generalized zero-shot learning. In AAAI, 2024.
  56. Rethinking few-shot image classification: a good embedding is all you need? In ECCV, 2020.
  57. Attention is all you need. In NeurIPS, 2017.
  58. Matching networks for one shot learning. In NeurIPS, 2016.
  59. Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (CSUR), 53(3):1–34, 2020.
  60. Few-shot classification with feature map reconstruction networks. In CVPR, 2021.
  61. Bi-directional feature reconstruction network for fine-grained few-shot image classification. In AAAI, 2023.
  62. Parn: Position-aware relation networks for few-shot learning. In ICCV, 2019.
  63. A survey of efficient fine-tuning methods for vision-language models—prompt and adapter. Computers & Graphics, 2024.
  64. Visual-language prompt tuning with knowledge-guided context optimization. In CVPR, 2023.
  65. Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers. In CVPR, 2020.
  66. Vision-language models for vision tasks: A survey. IEEE TPAMI, 2024.
  67. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  68. Tip-adapter: Training-free clip-adapter for better vision-language modeling. In ECCV, 2022.
  69. Conditional prompt learning for vision-language models. In CVPR, 2022a.
  70. Learning to prompt for vision-language models. IJCV, 2022b.
  71. Anomalyclip: Object-agnostic prompt learning for zero-shot anomaly detection. arXiv preprint arXiv:2310.18961, 2023a.
  72. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In CVPR, 2023b.
  73. Prompt-aligned gradient for prompt tuning. In ICCV, 2023.

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 2 tweets with 8 likes about this paper.