Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Test-Time Adaptation of Vision-Language Models (2403.18293v1)

Published 27 Mar 2024 in cs.CV

Abstract: Test-time adaptation with pre-trained vision-LLMs has attracted increasing attention for tackling distribution shifts during the test time. Though prior studies have achieved very promising performance, they involve intensive computation which is severely unaligned with test-time adaptation. We design TDA, a training-free dynamic adapter that enables effective and efficient test-time adaptation with vision-LLMs. TDA works with a lightweight key-value cache that maintains a dynamic queue with few-shot pseudo labels as values and the corresponding test-sample features as keys. Leveraging the key-value cache, TDA allows adapting to test data gradually via progressive pseudo label refinement which is super-efficient without incurring any backpropagation. In addition, we introduce negative pseudo labeling that alleviates the adverse impact of pseudo label noises by assigning pseudo labels to certain negative classes when the model is uncertain about its pseudo label predictions. Extensive experiments over two benchmarks demonstrate TDA's superior effectiveness and efficiency as compared with the state-of-the-art. The code has been released in \url{https://kdiaaa.github.io/tda/}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
  2. Parameter-free online test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8344–8353, 2022.
  3. Contrastive test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 295–305, 2022.
  4. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  5. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  6. VirTex: Learning Visual Representations from Textual Annotations. In CVPR, 2021.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  8. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR Workshops, 2004.
  9. Diverse data augmentation with diffusions for effective test-time prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2704–2714, 2023.
  10. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  11. Semi-supervised learning by entropy minimization. Advances in neural information processing systems, 17, 2004.
  12. Unbounded cache model for online language modeling with open vocabulary. Advances in neural information processing systems, 30, 2017.
  13. Deep residual learning for image recognition. In CVPR, 2016.
  14. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., 2019.
  15. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, 2021a.
  16. Natural adversarial examples. In CVPR, pages 15262–15271, 2021b.
  17. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17980–17989, 2022.
  18. Test-time classifier adjustment module for model-agnostic domain generalization. Advances in Neural Information Processing Systems, 34:2427–2440, 2021.
  19. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  20. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  21. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2020.
  22. Nlnl: Negative learning for noisy labels. In Proceedings of the IEEE/CVF international conference on computer vision, pages 101–110, 2019.
  23. Joint negative and positive learning for noisy labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9442–9451, 2021.
  24. 3d object representations for fine-grained categorization. In ICCV Workshops, 2013.
  25. Graphadapter: Tuning vision-language models with dual knowledge graph. Advances in Neural Information Processing Systems, 36, 2024.
  26. Fine-grained visual classification of aircraft. Technical report, 2013.
  27. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  28. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, 2008.
  29. Emin Orhan. A simple cache model for image recognition. Advances in Neural Information Processing Systems, 31, 2018.
  30. Cats and dogs. In CVPR, 2012.
  31. Learning transferable visual models from natural language supervision. In ICML, 2021.
  32. Do imagenet classifiers generalize to imagenet? In ICML, 2019.
  33. In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. arXiv preprint arXiv:2101.06329, 2021.
  34. Improving robustness against common corruptions by covariate shift adaptation. Advances in Neural Information Processing Systems, 33:11539–11551, 2020.
  35. Test-time prompt tuning for zero-shot generalization in vision-language models. In Advances in Neural Information Processing Systems, 2022.
  36. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
  37. Test-time training with self-supervision for generalization under distribution shifts. In ICML, 2020.
  38. Test-time unsupervised domain adaptation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I 23, pages 428–436. Springer, 2020.
  39. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  40. Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2021.
  41. Learning robust global representations by penalizing local predictive power. In NeurIPS, 2019.
  42. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
  43. Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15671–15680, 2022.
  44. Temporal coherent test time optimization for robust video classification. In The Eleventh International Conference on Learning Representations, 2023.
  45. Task residual for tuning vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10899–10909, 2023.
  46. Adaptive risk minimization: Learning to adapt to domain shift. Advances in Neural Information Processing Systems, 34:23664–23678, 2021.
  47. MEMO: Test time robustness via adaptation and augmentation. In Advances in Neural Information Processing Systems, 2022a.
  48. Tip-adapter: Training-free adaption of clip for few-shot classification. In European Conference on Computer Vision, pages 493–510. Springer, 2022b.
  49. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15211–15222, 2023.
  50. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022a.
  51. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
Citations (17)

Summary

We haven't generated a summary for this paper yet.