Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion (2403.14119v3)

Published 21 Mar 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: In deep learning, test-time adaptation has gained attention as a method for model fine-tuning without the need for labeled data. A prime exemplification is the recently proposed test-time prompt tuning for large-scale vision-LLMs such as CLIP. Unfortunately, these prompts have been mainly developed to improve accuracy, overlooking the importance of calibration, which is a crucial aspect for quantifying prediction uncertainty. However, traditional calibration methods rely on substantial amounts of labeled data, making them impractical for test-time scenarios. To this end, this paper explores calibration during test-time prompt tuning by leveraging the inherent properties of CLIP. Through a series of observations, we find that the prompt choice significantly affects the calibration in CLIP, where the prompts leading to higher text feature dispersion result in better-calibrated predictions. Introducing the Average Text Feature Dispersion (ATFD), we establish its relationship with calibration error and present a novel method, Calibrated Test-time Prompt Tuning (C-TPT), for optimizing prompts during test-time with enhanced calibration. Through extensive experiments on different CLIP architectures and datasets, we show that C-TPT can effectively improve the calibration of test-time prompt tuning without needing labeled data. The code is publicly accessible at https://github.com/hee-suk-yoon/C-TPT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  2. Test-time prompt tuning for zero-shot generalization in vision-language models. In NeurIPS, 2022.
  3. Tent: Fully test-time adaptation by entropy minimization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=uXl3bZLkr3c.
  4. Memo: Test time robustness via adaptation and augmentation. Advances in Neural Information Processing Systems, 35:38629–38642, 2022.
  5. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/guo17a.html.
  6. Calibrating deep neural networks using focal loss. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 15288–15299. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/aeb7b30ef1d024a76f21a1d40e30c302-Paper.pdf.
  7. Adamer-ctc: Connectionist temporal classification with adaptive maximum entropy regularization for automatic speech recognition, 2024.
  8. MedCLIP: Contrastive learning from unpaired medical images and text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3876–3887, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.256.
  9. Large-scale domain-specific pretraining for biomedical vision-language processing, 2023.
  10. Towards unifying medical vision-and-language pre-training via soft prompts, 2023a.
  11. Clip-driven universal model for organ segmentation and tumor detection, 2023.
  12. Clip-nav: Using clip for zero-shot vision-and-language navigation, 2022.
  13. Clip on wheels: Zero-shot object navigation as object localization and exploration. ArXiv, abs/2203.10421, 2022.
  14. Simple but effective: Clip embeddings for embodied ai. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14829–14838, 2022.
  15. Latte: Language trajectory transformer. In NeurIPS 2022 Foundation Models for Decision Making Workshop.
  16. Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 2022a.
  17. Conditional prompt learning for vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022b.
  18. John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In ADVANCES IN LARGE MARGIN CLASSIFIERS, pages 61–74. MIT Press, 1999.
  19. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 609–616, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558607781.
  20. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, page 694–699, New York, NY, USA, 2002. Association for Computing Machinery. ISBN 158113567X. doi: 10.1145/775047.775151. URL https://doi.org/10.1145/775047.775151.
  21. Algorithmic Learning in a Random World. Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0387001522.
  22. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
  23. Obtaining well calibrated probabilities using bayesian binning. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1), Feb. 2015a. doi: 10.1609/aaai.v29i1.9602. URL https://ojs.aaai.org/index.php/AAAI/article/view/9602.
  24. Trainable calibration measures for neural networks from kernel mean embeddings. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2805–2814. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/kumar18a.html.
  25. Soft calibration objectives for neural networks. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=-tVD13hOsQ3.
  26. ESD: Expected squared difference as a tuning-free trainable calibration measure. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=bHW9njOSON.
  27. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1Ddp1-Rb.
  28. SMSMix: Sense-maintained sentence mixup for word sense disambiguation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1493–1502, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.107. URL https://aclanthology.org/2022.findings-emnlp.107.
  29. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  30. On the pitfall of mixup for uncertainty calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7609–7618, June 2023.
  31. The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243.
  32. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
  33. INTapt: Information-theoretic adversarial prompt tuning for enhanced non-native speech recognition. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 9893–9902, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.627. URL https://aclanthology.org/2023.findings-acl.627.
  34. Neutral editing framework for diffusion-based video editing. arXiv preprint arXiv:2312.06708, 2023c.
  35. Wavelet-guided acceleration of text inversion in diffusion-based image editing. arXiv preprint arXiv:2401.09794, 2024.
  36. PLOT: Prompt learning with optimal transport for vision-language models. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=zqwryBoXYnh.
  37. Changsheng Xu Hantao Yao, Rui Zhang. Visual-language prompt tuning with knowledge-guided context optimization. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  38. Obtaining well calibrated probabilities using bayesian binning. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1), Feb. 2015b. doi: 10.1609/aaai.v29i1.9602. URL https://ojs.aaai.org/index.php/AAAI/article/view/9602.
  39. Test-time adaptation to distribution shift by confidence maximization and input transformation. arXiv preprint arXiv:2106.14999, 2021.
  40. Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems, 34:15682–15694, 2021.
  41. What can we learn from the selective prediction and uncertainty estimation performance of 523 imagenet classifiers? In The Eleventh International Conference on Learning Representations.
  42. Laurens van der Maaten. Accelerating t-sne using tree-based algorithms. Journal of Machine Learning Research, 15(93):3221–3245, 2014. URL http://jmlr.org/papers/v15/vandermaaten14a.html.
  43. Statistics (international student edition). Pisani, R. Purves, 4th edn. WW Norton & Company, New York, 2007.
  44. Decoupled weight decay regularization. In International Conference on Learning Representations.
  45. Enabling calibration in the zero-shot inference of large vision-language models. In ICLR 2023 Workshop on Pitfalls of limited data and computation for Trustworthy ML, 2023. URL https://openreview.net/forum?id=na1T7ZGYb4.
  46. A simple zero-shot prompt weighting technique to improve prompt ensembling in text-image models, 2023.
  47. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  48. Caltech 101, Apr 2022.
  49. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  50. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  51. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
  52. Food-101 – mining discriminative components with random forests. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 446–461, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10599-4.
  53. Fine-grained visual classification of aircraft. Technical report, 2013.
  54. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492, 2010. doi: 10.1109/CVPR.2010.5539970.
  55. Describing textures in the wild. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014. doi: 10.1109/CVPR.2014.461.
  56. Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, pages 204–207. IEEE, 2018.
  57. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012. URL http://arxiv.org/abs/1212.0402.
  58. Natural adversarial examples. CVPR, 2021a.
  59. Do imagenet classifiers generalize to imagenet? In International conference on machine learning, pages 5389–5400. PMLR, 2019.
  60. The many faces of robustness: A critical analysis of out-of-distribution generalization. ICCV, 2021b.
  61. Learning robust global representations by penalizing local predictive power. In Advances in Neural Information Processing Systems, pages 10506–10518, 2019.
  62. Pareto-based multiobjective machine learning: An overview and case studies. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 38:397 – 415, 06 2008. doi: 10.1109/TSMCC.2008.919172.
  63. Pareto multi-task learning. In Thirty-third Conference on Neural Information Processing Systems (NeurIPS), pages 12037–12047, 2019.
  64. Learning the pareto front with hypernetworks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=NjF772F4ZZR.
  65. Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. In NeurIPS, 2023.
  66. Jerrold H Zar. Significance testing of the spearman rank correlation coefficient. Journal of the American Statistical Association, 67(339):578–580, 1972.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Hee Suk Yoon (15 papers)
  2. Eunseop Yoon (14 papers)
  3. Joshua Tian Jin Tee (6 papers)
  4. Mark Hasegawa-Johnson (62 papers)
  5. Yingzhen Li (60 papers)
  6. Chang D. Yoo (78 papers)
Citations (8)