Conditional Prototype Rectification Prompt Learning (2404.09872v2)
Abstract: Pre-trained large-scale vision-LLMs (VLMs) have acquired profound understanding of general visual concepts. Recent advancements in efficient transfer learning (ETL) have shown remarkable success in fine-tuning VLMs within the scenario of limited data, introducing only a few parameters to harness task-specific insights from VLMs. Despite significant progress, current leading ETL methods tend to overfit the narrow distributions of base classes seen during training and encounter two primary challenges: (i) only utilizing uni-modal information to modeling task-specific knowledge; and (ii) using costly and time-consuming methods to supplement knowledge. To address these issues, we propose a Conditional Prototype Rectification Prompt Learning (CPR) method to correct the bias of base examples and augment limited data in an effective way. Specifically, we alleviate overfitting on base classes from two aspects. First, each input image acquires knowledge from both textual and visual prototypes, and then generates sample-conditional text tokens. Second, we extract utilizable knowledge from unlabeled data to further refine the prototypes. These two strategies mitigate biases stemming from base classes, yielding a more effective classifier. Extensive experiments on 11 benchmark datasets show that our CPR achieves state-of-the-art performance on both few-shot classification and base-to-new generalization tasks. Our code is avaliable at \url{https://github.com/chenhaoxing/CPR}.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Food-101–mining discriminative components with random forests. In ECCV, pages 446–461, 2014.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
- Diffute: Universal text editing diffusion model. In NeurIPS, 2023.
- Describing textures in the wild. In CVPR, pages 3606–3613, 2014.
- Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst., 106(1):59–70, 2007.
- Clip-adapter: Better vision-language models with feature adapters. IJCV, pages 1–15, 2023.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. J-STARS, 12(7):2217–2226, 2019.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, pages 4904–4916, 2021.
- Maple: Multi-modal prompt learning. In CVPR, pages 19113–19122, 2023.
- 3d object representations for fine-grained categorization. In ICCV Workshop, pages 554–561, 2013.
- Graphadapter: Tuning vision-language models with dual knowledge graph. In NeurIPS, 2023.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- Automated flower classification over a large number of classes. In ICVGIP, pages 722–729, 2008.
- Cats and dogs. In CVPR, pages 3498–3505, 2012.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Cap2aug: Caption guided image to image data augmentation. arXiv preprint arXiv:2212.05404v2, 2023.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Sus-x: Training-free name-only transfer of vision-language models. In ICCV, pages 2725–2736, 2023.
- Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
- Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, pages 3485–3492, 2010.
- Visual-language prompt tuning with knowledge-guided context optimization. In CVPR, pages 6757–6767, 2023.
- Task residual for tuning vision-language models. In CVPR, pages 10899–10909, 2023.
- Tip-adapter: Training-free adaption of clip for few-shot classification. In ECCV, pages 493–510, 2022.
- Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022a.
- Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022b.
- Prompt-aligned gradient for prompt tuning. In CVPR, pages 15613–15623, 2023a.
- Not all features matter: Enhancing few-shot clip with adaptive prior refinement. arXiv preprint arXiv:2304.01195, 2023b.