Training-Free Unsupervised Prompt for Vision-Language Models (2404.16339v1)
Abstract: Prompt learning has become the most effective paradigm for adapting large pre-trained vision-LLMs (VLMs) to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo labels easily misguide the tuning process and result in poor representation capabilities. In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Specifically, we integrate both instance confidence and prototype scores to select representative samples, which are used to customize a reliable Feature Cache Model (FCM) for training-free inference. Then, we design a Multi-level Similarity Measure (MSM) that considers both feature-level and semantic-level similarities to calculate the distance between each test image and the cached sample as the weight of the corresponding cached label to generate similarity-based prediction probabilities. In this way, TFUP achieves surprising performance, even surpassing the training-base method on multiple classification datasets. Based on our TFUP, we propose a training-based approach (TFUP-T) to further boost the adaptation performance. In addition to the standard cross-entropy loss, TFUP-T adopts an additional marginal distribution entropy loss to constrain the model from a global perspective. Our TFUP-T achieves new state-of-the-art classification performance compared to unsupervised and few-shot adaptation approaches on multiple benchmarks. In particular, TFUP-T improves the classification accuracy of POUF by 3.3% on the most challenging Domain-Net dataset.
- Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks. IEEE, 2020.
- Rezero is all you need: Fast convergence at large depth. arXiv preprint arXiv:2003.04887, 2020.
- Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, pages 104–120. Springer, 2020.
- Pseudo-labeling curriculum for unsupervised domain adaptation. arXiv preprint arXiv:1908.00262, 2019.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An empirical study of training end-to-end vision-and-language transformers. In CVPR, pages 18166–18176, 2022.
- Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
- Softclip: Softer cross-modal alignment makes clip stronger. arXiv preprint arXiv:2303.17561, 2023.
- Revisiting self-training for neural sequence generation. arXiv preprint arXiv:1909.13788, 2019.
- Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649, 2022.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
- Self-training for end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7084–7088. IEEE, 2020.
- Discriminative clustering by regularized information maximization. Advances in neural information processing systems, 23, 2010.
- Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896. Atlanta, 2013.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Scaling language-image pre-training via masking. In CVPR, pages 23390–23400, 2023.
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021.
- Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning, pages 6028–6039. PMLR, 2020.
- Graphprompt: Unifying pre-training and downstream tasks for graph neural networks. arXiv preprint arXiv:2302.08043, 2023.
- Effective self-training for parsing. In Proceedings of the human language technology conference of the NAACL, main conference, pages 152–159, 2006.
- Geoffrey J McLachlan. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association, 70(350):365–369, 1975.
- Sree Hari Krishnan Parthasarathi and Nikko Strom. Lessons from building acoustic models with a million hours of speech. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6670–6674. IEEE, 2019.
- Moment matching for multi-source domain adaptation. In ICCV, pages 1406–1415, 2019.
- Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Semi-supervised self-training of object detection models. Workshops on Applications of Computer Vision, 2005.
- Adapting visual category models to new domains. In ECCV, pages 213–226. Springer, 2010.
- Information-theoretical learning of discriminative clusters for unsupervised domain adaptation. arXiv preprint arXiv:1206.6438, 2012.
- Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608, 2020.
- A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757, 2020.
- Pouf: Prompt-oriented unsupervised fine-tuning for large pre-trained models. In International Conference on Machine Learning, pages 33816–33832. PMLR, 2023.
- Deep hashing network for unsupervised domain adaptation. In CVPR, pages 5018–5027, 2017.
- Tent: Fully test-time adaptation by entropy minimization. arXiv preprint arXiv:2006.10726, 2020.
- End-to-end semi-supervised object detection with soft teacher. In ICCV, pages 3060–3069, 2021.
- Visual-language prompt tuning with knowledge-guided context optimization. In CVPR, pages 6757–6767, 2023.
- Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
- Instance-specific and model-adaptive supervision for semi-supervised semantic segmentation. In CVPR, pages 23705–23714, 2023.
- Augmentation matters: A simple-yet-effective approach to semi-supervised semantic segmentation. In CVPR, pages 11350–11359, 2023.
- Entropy-based optimization on individual and global predictions for semi-supervised learning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 8346–8355, 2023.
- Lassl: Label-guided self-training for semi-supervised learning. In AAAI, volume 36, pages 9208–9216, 2022.
- Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
- Sifan Long (11 papers)
- Linbin Wang (3 papers)
- Zhen Zhao (85 papers)
- Zichang Tan (25 papers)
- Yiming Wu (31 papers)
- Shengsheng Wang (14 papers)
- Jingdong Wang (236 papers)