AAPL: Adding Attributes to Prompt Learning for Vision-Language Models (2404.16804v1)
Abstract: Recent advances in large pre-trained vision-LLMs have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where context within a prompt is replaced with learnable vectors, leading to significant improvements over manually crafted prompts. However, the performance improvement for unseen classes is still marginal, and to tackle this problem, data augmentation has been frequently used in traditional zero-shot learning techniques. Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes. To address this problem, we propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. Through our novel mechanism called "Adding Attributes to Prompt Learning", AAPL, we guide the learnable context to effectively extract text features by focusing on high-level features for unseen classes. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.
- Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
- Food-101–mining discriminative components with random forests. In ECCV, 2014.
- Modeling inter and intra-class relations in the triplet loss for zero-shot learning. In ICCV, 2019.
- Synthesized classifiers for zero-shot learning. In CVPR, 2016.
- An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, 2016.
- A simple framework for contrastive learning of visual representations. In ICML, 2020.
- Describing textures in the wild. In CVPR, 2014.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Towards understanding linear word analogies. In ACL, 2019.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR Workshop, 2004.
- Deep residual learning for image recognition. In CVPR, 2016.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021a.
- Natural adversarial examples. In CVPR, 2021b.
- Deep metric learning using triplet network. In Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3. Springer, 2015.
- Stacked semantics-guided attention model for fine-grained zero-shot learning. NIPS, 2018.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Visual prompt tuning. In ECCV, 2022.
- Transferable contrastive network for generalized zero-shot learning. In ICCV, 2019.
- Learning attention propagation for compositional zero-shot learning. In WACV, 2023.
- Maple: Multi-modal prompt learning. In CVPR, 2023a.
- Self-regulating prompts: Foundational model adaptation without forgetting. In ICCV, 2023b.
- Co-domain embedding using deep quadruplet networks for unseen traffic sign recognition. In AAAI, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Learning active learning from data. NIPS, 2017.
- 3d object representations for fine-grained categorization. In ICCV Workshop, 2013.
- The power of scale for parameter-efficient prompt tuning. In EMNLP, 2021.
- Prefix-tuning: Optimizing continuous prompts for generation. In ACL, 2021.
- P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In ACL, 2022.
- Hierarchical prompt learning for multi-task learning. In CVPR, 2023.
- Prompt distribution learning. In CVPR, 2022.
- Understanding and mitigating overfitting in prompt tuning for vision-language models. IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- Open world compositional zero-shot learning. In CVPR, 2021.
- Linguistic regularities in continuous space word representations. In NAACL, 2013.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, 2008.
- Adversarial robustness of prompt-based few-shot learning for natural language understanding. In ACL, 2023.
- Cats and dogs. In CVPR, 2012.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Deep active learning for image classification. In ICIP. IEEE, 2017.
- Do imagenet classifiers generalize to imagenet? In ICML, 2019.
- An embarrassingly simple approach to zero-shot learning. In ICML, 2015.
- Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
- Cluster quality analysis using silhouette score. In 2020 IEEE 7th international conference on data science and advanced analytics (DSAA). IEEE, 2020.
- Test-time prompt tuning for zero-shot generalization in vision-language models. NeurIPS, 35:14274–14289, 2022.
- Flava: A foundational language and vision alignment model. In CVPR, 2022.
- Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. NIPS, 2016.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Active learning helps pretrained models learn the intended task. NeurIPS, 2022.
- Attention is all you need. NIPS, 2017.
- Learning robust global representations by penalizing local predictive power. NeurIPS, 2019.
- Learning to prompt for continual learning. In CVPR, 2022.
- Distance metric learning for large margin nearest neighbor classification. Journal of machine learning research, 10(2), 2009.
- Adversarial soft prompt tuning for cross-domain sentiment analysis. In ACL, 2022.
- Latent embeddings for zero-shot classification. In CVPR, 2016.
- Zero-shot learning-the good, the bad and the ugly. In CVPR, 2017.
- Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence, 41(9):2251–2265, 2018.
- Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.
- Attribute prototype network for zero-shot learning. NeurIPS, 2020.
- Visual-language prompt tuning with knowledge-guided context optimization. In CVPR, 2023.
- FILIP: Fine-grained interactive language-image pre-training. In ICLR, 2022.
- Textmania: Enriching visual feature by text-driven manifold augmentation. In ICCV, 2023.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Lit: Zero-shot transfer with locked-image text tuning. In CVPR, 2022.
- Learning a deep embedding model for zero-shot learning. In CVPR, 2017.
- Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference, 2022.
- Conditional prompt learning for vision-language models. In CVPR, 2022a.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
- Prompt-aligned gradient for prompt tuning. In CVPR, 2023.
- Gahyeon Kim (2 papers)
- Sohee Kim (3 papers)
- Seokju Lee (20 papers)