Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training-Free Unsupervised Prompt for Vision-Language Models (2404.16339v1)

Published 25 Apr 2024 in cs.CV and cs.AI

Abstract: Prompt learning has become the most effective paradigm for adapting large pre-trained vision-LLMs (VLMs) to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo labels easily misguide the tuning process and result in poor representation capabilities. In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Specifically, we integrate both instance confidence and prototype scores to select representative samples, which are used to customize a reliable Feature Cache Model (FCM) for training-free inference. Then, we design a Multi-level Similarity Measure (MSM) that considers both feature-level and semantic-level similarities to calculate the distance between each test image and the cached sample as the weight of the corresponding cached label to generate similarity-based prediction probabilities. In this way, TFUP achieves surprising performance, even surpassing the training-base method on multiple classification datasets. Based on our TFUP, we propose a training-based approach (TFUP-T) to further boost the adaptation performance. In addition to the standard cross-entropy loss, TFUP-T adopts an additional marginal distribution entropy loss to constrain the model from a global perspective. Our TFUP-T achieves new state-of-the-art classification performance compared to unsupervised and few-shot adaptation approaches on multiple benchmarks. In particular, TFUP-T improves the classification accuracy of POUF by 3.3% on the most challenging Domain-Net dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks. IEEE, 2020.
  2. Rezero is all you need: Fast convergence at large depth. arXiv preprint arXiv:2003.04887, 2020.
  3. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, pages 104–120. Springer, 2020.
  4. Pseudo-labeling curriculum for unsupervised domain adaptation. arXiv preprint arXiv:1908.00262, 2019.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  6. An empirical study of training end-to-end vision-and-language transformers. In CVPR, pages 18166–18176, 2022.
  7. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  8. Softclip: Softer cross-modal alignment makes clip stronger. arXiv preprint arXiv:2303.17561, 2023.
  9. Revisiting self-training for neural sequence generation. arXiv preprint arXiv:1909.13788, 2019.
  10. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  11. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  12. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  13. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649, 2022.
  14. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  15. Self-training for end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7084–7088. IEEE, 2020.
  16. Discriminative clustering by regularized information maximization. Advances in neural information processing systems, 23, 2010.
  17. Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta, 2013.
  18. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  19. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  20. Scaling language-image pre-training via masking. In CVPR, pages 23390–23400, 2023.
  21. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021.
  22. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning, pages 6028–6039. PMLR, 2020.
  23. Graphprompt: Unifying pre-training and downstream tasks for graph neural networks. arXiv preprint arXiv:2302.08043, 2023.
  24. Effective self-training for parsing. In Proceedings of the human language technology conference of the NAACL, main conference, pages 152–159, 2006.
  25. Geoffrey J McLachlan. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association, 70(350):365–369, 1975.
  26. Sree Hari Krishnan Parthasarathi and Nikko Strom. Lessons from building acoustic models with a million hours of speech. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6670–6674. IEEE, 2019.
  27. Moment matching for multi-source domain adaptation. In ICCV, pages 1406–1415, 2019.
  28. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
  29. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  30. Semi-supervised self-training of object detection models. Workshops on Applications of Computer Vision, 2005.
  31. Adapting visual category models to new domains. In ECCV, pages 213–226. Springer, 2010.
  32. Information-theoretical learning of discriminative clusters for unsupervised domain adaptation. arXiv preprint arXiv:1206.6438, 2012.
  33. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608, 2020.
  34. A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757, 2020.
  35. Pouf: Prompt-oriented unsupervised fine-tuning for large pre-trained models. In International Conference on Machine Learning, pages 33816–33832. PMLR, 2023.
  36. Deep hashing network for unsupervised domain adaptation. In CVPR, pages 5018–5027, 2017.
  37. Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020.
  38. End-to-end semi-supervised object detection with soft teacher. In ICCV, pages 3060–3069, 2021.
  39. Visual-language prompt tuning with knowledge-guided context optimization. In CVPR, pages 6757–6767, 2023.
  40. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
  41. Instance-specific and model-adaptive supervision for semi-supervised semantic segmentation. In CVPR, pages 23705–23714, 2023.
  42. Augmentation matters: A simple-yet-effective approach to semi-supervised semantic segmentation. In CVPR, pages 11350–11359, 2023.
  43. Entropy-based optimization on individual and global predictions for semi-supervised learning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 8346–8355, 2023.
  44. Lassl: Label-guided self-training for semi-supervised learning. In AAAI, volume 36, pages 9208–9216, 2022.
  45. Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022.
  46. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Sifan Long (11 papers)
  2. Linbin Wang (3 papers)
  3. Zhen Zhao (85 papers)
  4. Zichang Tan (25 papers)
  5. Yiming Wu (31 papers)
  6. Shengsheng Wang (14 papers)
  7. Jingdong Wang (236 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com