Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models (2305.18010v2)
Abstract: One fascinating aspect of pre-trained vision-LLMs~(VLMs) learning under language supervision is their impressive zero-shot generalization capability. However, this ability is hindered by distribution shifts between the training and testing data. Previous test time adaptation~(TTA) methods for VLMs in zero-shot classification rely on minimizing the entropy of model outputs, tending to be stuck in incorrect model predictions. In this work, we propose TTA with feedback to rectify the model output and prevent the model from becoming blindly confident. Specifically, a CLIP model is adopted as the reward model during TTA and provides feedback for the VLM. Given a single test sample, the VLM is forced to maximize the CLIP reward between the input and sampled results from the VLM output distribution. The proposed \textit{reinforcement learning with CLIP feedback~(RLCF)} framework is highly flexible and universal. Beyond the classification task, with task-specific sampling strategies and a proper reward baseline choice, RLCF can be easily extended to not only discrimination tasks like retrieval but also generalization tasks like image captioning, improving the zero-shot generalization capacity of VLMs. According to the characteristics of these VL tasks, we build different fully TTA pipelines with RLCF to improve the zero-shot generalization ability of various VLMs. Extensive experiments along with promising empirical results demonstrate the effectiveness of RLCF. The code is available at https://github.com/mzhaoshuai/RLCF.
- Nocaps: Novel object captioning at scale. In ICCV, 2019.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- A simple framework for contrastive learning of visual representations. In ICML, 2020.
- Fine-grained image captioning with clip reward. In Findings of NAACL, 2022.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
- RLPrompt: Optimizing discrete text prompts with reinforcement learning. In EMNLP, 2022. URL https://aclanthology.org/2022.emnlp-main.222.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
- On calibration of modern neural networks. In Doina Precup and Yee Whye Teh (eds.), ICML, 2017.
- Reducing the teacher-student gap via spherical knowledge disitllation. arXiv preprint arXiv:2010.07485, 2020.
- Deep residual learning for image recognition. In CVPR, 2016.
- The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021a.
- Natural adversarial examples. In CVPR, pp. 15262–15271, 2021b.
- CLIPScore: a reference-free evaluation metric for image captioning. In EMNLP, 2021.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Avatarclip: zero-shot text-driven generation and animation of 3d avatars. ACM Trans. Graph., 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2015.
- Deep reinforcement learning in computer vision: a comprehensive survey. Artificial Intelligence Review, 2022.
- Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, 2013.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023.
- Decap: Decoding CLIP latents for zero-shot captioning via text-only training. In ICLR, 2023. URL https://openreview.net/forum?id=Lt8bMlhiwx2.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Video test-time adaptation for action recognition. CVPR, 2023.
- Ttt++: When does self-supervised test-time training fail or thrive? NeurIPS, 2021.
- Decoupled weight decay regularization. In ICLR, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Test-time prompt tuning for zero-shot generalization in vision-language models. In NeurIPS, 2022.
- Revisiting the calibration of modern neural networks. In NeurIPS, 2021.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- Efficient test-time model adaptation without forgetting. In ICML, 2022.
- Towards stable test-time adaptation in dynamic wild world. ICLR, 2023.
- Text-only training for image captioning using noise-injected clip. In Findings of EMNLP, 2022.
- OpenAI. Gpt-4 technical report. arXiv, 2023.
- Training language models to follow instructions with human feedback. NeurIPS, 2022.
- Tuning computer vision models with task rewards. arXiv preprint arXiv:2302.08242, 2023.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Do imagenet classifiers generalize to imagenet? In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), ICML, 2019.
- Self-critical sequence training for image captioning. In CVPR, 2017.
- CLIP for all things zero-shot sketch-based image retrieval, fine-grained or not. In CVPR, 2023.
- Improving robustness against common corruptions by covariate shift adaptation. NeurIPS, 2020.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Learning to summarize with human feedback. NeurIPS, 2020.
- Language models can see: plugging visual controls in text generation. arXiv preprint arXiv:2205.02655, 2022.
- Test-time training with self-supervision for generalization under distribution shifts. In ICML, 2020.
- Attention is all you need. NeurIPS, 2017.
- Cider: Consensus-based image description evaluation. In CVPR, 2015.
- Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2021a.
- Learning robust global representations by penalizing local predictive power. In NeurIPS, 2019.
- Accelerate cnns from three dimensions: A comprehensive pruning framework. In ICML, 2021b.
- Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 1992.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In ICML, 2022.
- Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, November 2021.
- Train/test-time adaptation with retrieval. CVPR, 2023.
- Memo: Test time robustness via adaptation and augmentation. NeurIPS, 2022a.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022b.
- Tempera: Test-time prompt editing via reinforcement learning. In ICLR, 2023.
- Decoupled knowledge distillation. In CVPR, 2022.
- Learning to prompt for vision-language models. CoRR, abs/2109.01134, 2021.
- Conditional prompt learning for vision-language models. In CVPR, 2022.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- Shuai Zhao (116 papers)
- Xiaohan Wang (91 papers)
- Linchao Zhu (78 papers)
- Yi Yang (855 papers)