Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting (2410.19294v1)
Abstract: Vision-LLMs, such as CLIP, have shown impressive generalization capacities when using appropriate text descriptions. While optimizing prompts on downstream labeled data has proven effective in improving performance, these methods entail labor costs for annotations and are limited by their quality. Additionally, since CLIP is pre-trained on highly imbalanced Web-scale data, it suffers from inherent label bias that leads to suboptimal performance. To tackle the above challenges, we propose a label-Free prompt distribution learning and bias correction framework, dubbed as Frolic, which boosts zero-shot performance without the need for labeled data. Specifically, our Frolic learns distributions over prompt prototypes to capture diverse visual representations and adaptively fuses these with the original CLIP through confidence matching. This fused model is further enhanced by correcting label bias via a label-free logit adjustment. Notably, our method is not only training-free but also circumvents the necessity for hyper-parameter tuning. Extensive experimental results across 16 datasets demonstrate the efficacy of our approach, particularly outperforming the state-of-the-art by an average of $2.6\%$ on 10 datasets with CLIP ViT-B/16 and achieving an average margin of $1.5\%$ on ImageNet and its five distribution shifts with CLIP ViT-B/16. Codes are available in https://github.com/zhuhsingyuu/Frolic.
- A simple zero-shot prompt weighting technique to improve prompt ensembling in text-image models. In ICML, 2023.
- Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. In NeurIPS, 2019.
- Christopher M. Bishop. Pattern recognition and machine learning, 5th Edition. Information science and statistics. Springer, 2007.
- Food-101 - mining discriminative components with random forests. In ECCV, 2014.
- On catastrophic inheritance of large foundation models. arXiv preprint arXiv:2402.01909, 2024.
- Reproducible scaling laws for contrastive language-image learning. In CVPR, 2023.
- Describing textures in the wild. In CVPR, 2014.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR Workshops, 2004.
- On calibration of modern neural networks. In ICML, 2017.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 12(7):2217–2226, 2019.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021.
- Natural adversarial examples. In CVPR, 2021.
- Disentangling label distribution for long-tailed visual recognition. In CVPR, 2021.
- Efficient test-time adaptation of vision-language models. In CVPR, 2024.
- 3d object representations for fine-grained categorization. In ICCV Workshops, 2013.
- Calibrated ensembles can mitigate accuracy tradeoffs under distribution shift. In UAI, 2022.
- Prompt distribution learning. In CVPR, 2022.
- Privacy preserving recalibration under domain shift. CoRR, abs/2008.09643, 2020.
- Fine-grained visual classification of aircraft. CoRR, abs/1306.5151, 2013.
- Long-tail learning via logit adjustment. In ICLR, 2021.
- RV Mises and Hilda Pollaczek-Geiringer. Praktische verfahren der gleichungsauflösung. ZAMM-Journal of Applied Mathematics and Mechanics/Zeitschrift für Angewandte Mathematik und Mechanik, 9(1):58–77, 1929.
- Automated flower classification over a large number of classes. In ICVGIP, 2008.
- Black box few-shot adaptation for vision-language models. CoRR, abs/2304.01752, 2023.
- The neglected tails of vision-language models. arXiv preprint arXiv:2401.12425, 2024.
- Cats and dogs. In CVPR, 2012.
- What does a platypus look like? generating customized prompts for zero-shot image classification. In ICCV, 2023.
- Intra-modal proxy learning for zero-shot visual categorization with CLIP. In NeurIPS, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Do imagenet classifiers generalize to imagenet? In ICML, 2019.
- Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. In NeurIPS, 2023.
- Test-time prompt tuning for zero-shot generalization in vision-language models. In NeurIPS, 2022.
- UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
- Long-tailed classification by keeping the good and removing the bad momentum causal effect. In NeurIPS, 2020.
- Sus-x: Training-free name-only transfer of vision-language models. CoRR, abs/2211.16198, 2022.
- Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pages 10506–10518, 2019.
- Debiased learning from naturally imbalanced pseudo-labels. In CVPR, 2022.
- A hard-to-beat baseline for training-free CLIP-based adaptation. In ICLR, 2024.
- Bi-directional distribution alignment for transductive zero-shot learning. In CVPR, 2023.
- Robust fine-tuning of zero-shot models. In CVPR, 2022.
- Gpt4vis: What can GPT-4 do for zero-shot visual recognition? CoRR, abs/2311.15732, 2023.
- SUN database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.
- Invariant training 2d-3d joint hard samples for few-shot point cloud recognition. In ICCV, 2023.
- Tip-adapter: Training-free adaption of CLIP for few-shot classification. In ECCV, 2022.
- Conditional prompt learning for vision-language models. In CVPR, 2022.
- Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022.
- Prompt-aligned gradient for prompt tuning. In ICCV, 2023.
- Cross-domain empirical risk minimization for unbiased long-tailed classification. In AAAI, 2022.
- Generalized logit adjustment: Calibrating fine-tuned models by removing label bias in foundation models. In NeurIPS, 2023.
- Selective vision-language subspace projection for few-shot CLIP. In ACM MM, 2024.