Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Few-shot Adaptation of Multi-modal Foundation Models: A Survey (2401.01736v2)

Published 3 Jan 2024 in cs.CV

Abstract: Multi-modal (vision-language) models, such as CLIP, are replacing traditional supervised pre-training models (e.g., ImageNet-based pre-training) as the new generation of visual foundation models. These models with robust and aligned semantic representations learned from billions of internet image-text pairs and can be applied to various downstream tasks in a zero-shot manner. However, in some fine-grained domains like medical imaging and remote sensing, the performance of multi-modal foundation models often leaves much to be desired. Consequently, many researchers have begun to explore few-shot adaptation methods for these models, gradually deriving three main technical approaches: 1) prompt-based methods, 2) adapter-based methods, and 3) external knowledge-based methods. Nevertheless, this rapidly developing field has produced numerous results without a comprehensive survey to systematically organize the research progress. Therefore, in this survey, we introduce and analyze the research advancements in few-shot adaptation methods for multi-modal models, summarizing commonly used datasets and experimental setups, and comparing the results of different methods. In addition, due to the lack of reliable theoretical support for existing methods, we derive the few-shot adaptation generalization error bound for multi-modal models. The theorem reveals that the generalization error of multi-modal foundation models is constrained by three factors: domain gap, model capacity, and sample size. Based on this, we propose three possible solutions from the following aspects: 1) adaptive domain generalization, 2) adaptive model selection, and 3) adaptive knowledge utilization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (102)
  1. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds.   Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423
  2. W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang, K. Wang, X. Zhang, C. Li, Z. Gong, Y. Yao, X. Huang, J. Wang, J. Yu, Q. Guo, Y. Yu, Y. Zhang, J. Wang, H. Tao, D. Yan, Z. Yi, F. Peng, F. Jiang, H. Zhang, L. Deng, Y. Zhang, Z. Lin, C. Zhang, S. Zhang, M. Guo, S. Gu, G. Fan, Y. Wang, X. Jin, Q. Liu, and Y. Tian, “Pangu-α𝛼\alphaitalic_α: Large-scale autoregressive pretrained chinese language models with auto-parallel computation,” CoRR, vol. abs/2104.12369, 2021. [Online]. Available: https://arxiv.org/abs/2104.12369
  3. A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, “Palm: Scaling language modeling with pathways,” CoRR, vol. abs/2204.02311, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2204.02311
  4. OpenAI, “GPT-4 technical report,” CoRR, vol. abs/2303.08774, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.08774
  5. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 13–23. [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html
  6. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139.   PMLR, 2021, pp. 8748–8763. [Online]. Available: http://proceedings.mlr.press/v139/radford21a.html
  7. Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, “Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.   OpenReview.net, 2022. [Online]. Available: https://openreview.net/forum?id=zq1iJkNk3uN
  8. L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu, “FILIP: fine-grained interactive language-image pre-training,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.   OpenReview.net, 2022. [Online]. Available: https://openreview.net/forum?id=cpDhcsEDC2
  9. Y. Gao, J. Liu, Z. Xu, J. Zhang, K. Li, R. Ji, and C. Shen, “Pyramidclip: Hierarchical feature alignment for vision-language model pretraining,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper\_files/paper/2022/hash/e9882f7f7c44a10acc01132302bac9d8-Abstract-Conference.html
  10. H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once-for-all: Train one network and specialize it for efficient deployment,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.   OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=HylxE1HKwS
  11. W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign language: Beit pretraining for all vision and vision-language tasks,” CoRR, vol. abs/2208.10442, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2208.10442
  12. B. Shan, W. Yin, Y. Sun, H. Tian, H. Wu, and H. Wang, “Ernie-vil 2.0: Multi-view contrastive learning for image-text pre-training,” CoRR, vol. abs/2209.15270, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2209.15270
  13. A. Baevski, W. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, Eds., vol. 162.   PMLR, 2022, pp. 1298–1312. [Online]. Available: https://proceedings.mlr.press/v162/baevski22a.html
  14. K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, 2022. [Online]. Available: https://doi.org/10.1007/s11263-022-01653-1
  15. P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, “Clip-adapter: Better vision-language models with feature adapters,” CoRR, vol. abs/2110.04544, 2021. [Online]. Available: https://arxiv.org/abs/2110.04544
  16. S. M. Pratt, R. Liu, and A. Farhadi, “What does a platypus look like? generating customized prompts for zero-shot image classification,” CoRR, vol. abs/2209.03320, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2209.03320
  17. A. Radford and K. Narasimhan, “Improving language understanding by generative pre-training,” 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:49313245
  18. T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119.   PMLR, 2020, pp. 1597–1607. [Online]. Available: http://proceedings.mlr.press/v119/chen20j.html
  19. K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick, “Momentum contrast for unsupervised visual representation learning,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020.   Computer Vision Foundation / IEEE, 2020, pp. 9726–9735. [Online]. Available: https://doi.org/10.1109/CVPR42600.2020.00975
  20. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. B. Girshick, “Masked autoencoders are scalable vision learners,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022.   IEEE, 2022, pp. 15 979–15 988. [Online]. Available: https://doi.org/10.1109/CVPR52688.2022.01553
  21. H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: BERT pre-training of image transformers,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.   OpenReview.net, 2022. [Online]. Available: https://openreview.net/forum?id=p-BhZSz59o4
  22. C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139.   PMLR, 2021, pp. 4904–4916. [Online]. Available: http://proceedings.mlr.press/v139/jia21b.html
  23. Y. Huo, M. Zhang, G. Liu, H. Lu, Y. Gao, G. Yang, J. Wen, H. Zhang, B. Xu, W. Zheng, Z. Xi, Y. Yang, A. Hu, J. Zhao, R. Li, Y. Zhao, L. Zhang, Y. Song, X. Hong, W. Cui, D. Y. Hou, Y. Li, J. Li, P. Liu, Z. Gong, C. Jin, Y. Sun, S. Chen, Z. Lu, Z. Dou, Q. Jin, Y. Lan, W. X. Zhao, R. Song, and J. Wen, “Wenlan: Bridging vision and language by large-scale multi-modal pre-training,” CoRR, vol. abs/2103.06561, 2021. [Online]. Available: https://arxiv.org/abs/2103.06561
  24. J. Wang, H. Wang, J. Deng, W. Wu, and D. Zhang, “Efficientclip: Efficient cross-modal pre-training by ensemble confident learning and language modeling,” CoRR, vol. abs/2109.04699, 2021. [Online]. Available: https://arxiv.org/abs/2109.04699
  25. A. Tejankar, M. Sanjabi, B. Wu, S. Xie, M. Khabsa, H. Pirsiavash, and H. Firooz, “A fistful of words: Learning transferable visual models from bag-of-words supervision,” CoRR, vol. abs/2112.13884, 2021. [Online]. Available: https://arxiv.org/abs/2112.13884
  26. N. Mu, A. Kirillov, D. A. Wagner, and S. Xie, “SLIP: self-supervision meets language-image pre-training,” in Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVI, ser. Lecture Notes in Computer Science, S. Avidan, G. J. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds., vol. 13686.   Springer, 2022, pp. 529–544. [Online]. Available: https://doi.org/10.1007/978-3-031-19809-0\_30
  27. J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, Eds., vol. 162.   PMLR, 2022, pp. 12 888–12 900. [Online]. Available: https://proceedings.mlr.press/v162/li22n.html
  28. A. Fürst, E. Rumetshofer, J. Lehner, V. T. Tran, F. Tang, H. Ramsauer, D. P. Kreil, M. Kopp, G. Klambauer, A. Bitto, and S. Hochreiter, “CLOOB: modern hopfield networks with infoloob outperform CLIP,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper\_files/paper/2022/hash/8078e76f913e31b8467e85b4c0f0d22b-Abstract-Conference.html
  29. D. Chen, Z. Wu, F. Liu, Z. Yang, Y. Huang, Y. Bao, and E. Zhou, “Prototypical contrastive language image pretraining,” CoRR, vol. abs/2206.10996, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2206.10996
  30. J. Wang, Y. Zhang, L. Zhang, P. Yang, X. Gao, Z. Wu, X. Dong, J. He, J. Zhuo, Q. Yang, Y. Huang, X. Li, Y. Wu, J. Lu, X. Zhu, W. Chen, T. Han, K. Pan, R. Wang, H. Wang, X. Wu, Z. Zeng, C. Chen, R. Gan, and J. Zhang, “Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence,” CoRR, vol. abs/2209.02970, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2209.02970
  31. A. Yang, J. Pan, J. Lin, R. Men, Y. Zhang, J. Zhou, and C. Zhou, “Chinese CLIP: contrastive vision-language pretraining in chinese,” CoRR, vol. abs/2211.01335, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.01335
  32. Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao, “EVA-CLIP: improved training techniques for CLIP at scale,” CoRR, vol. abs/2303.15389, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.15389
  33. Z. Lin, S. Yu, Z. Kuang, D. Pathak, and D. Ramanan, “Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023.   IEEE, 2023, pp. 19 325–19 337. [Online]. Available: https://doi.org/10.1109/CVPR52729.2023.01852
  34. B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds.   Association for Computational Linguistics, 2021, pp. 3045–3059. [Online]. Available: https://doi.org/10.18653/v1/2021.emnlp-main.243
  35. T. Shin, Y. Razeghi, R. L. L. IV, E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu, Eds.   Association for Computational Linguistics, 2020, pp. 4222–4235. [Online]. Available: https://doi.org/10.18653/v1/2020.emnlp-main.346
  36. X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds.   Association for Computational Linguistics, 2021, pp. 4582–4597. [Online]. Available: https://doi.org/10.18653/v1/2021.acl-long.353
  37. L. Reynolds and K. McDonell, “Prompt programming for large language models: Beyond the few-shot paradigm,” in CHI ’21: CHI Conference on Human Factors in Computing Systems, Virtual Event / Yokohama Japan, May 8-13, 2021, Extended Abstracts, Y. Kitamura, A. Quigley, K. Isbister, and T. Igarashi, Eds.   ACM, 2021, pp. 314:1–314:7. [Online]. Available: https://doi.org/10.1145/3411763.3451760
  38. X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, “GPT understands, too,” CoRR, vol. abs/2103.10385, 2021. [Online]. Available: https://arxiv.org/abs/2103.10385
  39. K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022.   IEEE, 2022, pp. 16 795–16 804. [Online]. Available: https://doi.org/10.1109/CVPR52688.2022.01631
  40. T. Huang, J. Chu, and F. Wei, “Unsupervised prompt learning for vision-language models,” CoRR, vol. abs/2204.03649, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2204.03649
  41. B. Zhu, Y. Niu, Y. Han, Y. Wu, and H. Zhang, “Prompt-aligned gradient for prompt tuning,” CoRR, vol. abs/2205.14865, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.14865
  42. G. Chen, W. Yao, X. Song, X. Li, Y. Rao, and K. Zhang, “PLOT: prompt learning with optimal transport for vision-language models,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.   OpenReview.net, 2023. [Online]. Available: https://openreview.net/pdf?id=zqwryBoXYnh
  43. Y. Lu, J. Liu, Y. Zhang, Y. Liu, and X. Tian, “Prompt distribution learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022.   IEEE, 2022, pp. 5196–5205. [Online]. Available: https://doi.org/10.1109/CVPR52688.2022.00514
  44. K. Ding, Y. Wang, P. Liu, Q. Yu, H. Zhang, S. Xiang, and C. Pan, “Prompt tuning with soft context sharing for vision-language models,” CoRR, vol. abs/2208.13474, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2208.13474
  45. M. Jia, L. Tang, B. Chen, C. Cardie, S. J. Belongie, B. Hariharan, and S. Lim, “Visual prompt tuning,” in Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIII, ser. Lecture Notes in Computer Science, S. Avidan, G. J. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds., vol. 13693.   Springer, 2022, pp. 709–727. [Online]. Available: https://doi.org/10.1007/978-3-031-19827-4\_41
  46. Y. Xing, Q. Wu, D. Cheng, S. Zhang, G. Liang, and Y. Zhang, “Class-aware visual prompt tuning for vision-language pre-trained model,” CoRR, vol. abs/2208.08340, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2208.08340
  47. H. Bahng, A. Jahanian, S. Sankaranarayanan, and P. Isola, “Exploring visual prompts for adapting large-scale models,” 2022.
  48. J. Wu, X. Li, C. Wei, H. Wang, A. L. Yuille, Y. Zhou, and C. Xie, “Unleashing the power of visual prompting at the pixel level,” CoRR, vol. abs/2212.10556, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2212.10556
  49. M. U. Khattak, H. A. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” CoRR, vol. abs/2210.03117, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2210.03117
  50. Y. Zang, W. Li, K. Zhou, C. Huang, and C. C. Loy, “Unified vision and language prompt learning,” CoRR, vol. abs/2210.07225, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2210.07225
  51. S. Shen, S. Yang, T. Zhang, B. Zhai, J. E. Gonzalez, K. Keutzer, and T. Darrell, “Multitask vision-language prompt tuning,” CoRR, vol. abs/2211.11720, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.11720
  52. N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.   PMLR, 2019, pp. 2790–2799. [Online]. Available: http://proceedings.mlr.press/v97/houlsby19a.html
  53. N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C. Chan, W. Chen, J. Yi, W. Zhao, X. Wang, Z. Liu, H. Zheng, J. Chen, Y. Liu, J. Tang, J. Li, and M. Sun, “Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models,” CoRR, vol. abs/2203.06904, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2203.06904
  54. S. Jie and Z. Deng, “Convolutional bypasses are better vision transformer adapters,” CoRR, vol. abs/2207.07039, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2207.07039
  55. ——, “Fact: Factor-tuning for lightweight adaptation on vision transformer,” in Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, B. Williams, Y. Chen, and J. Neville, Eds.   AAAI Press, 2023, pp. 1060–1068. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/25187
  56. R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y. Qiao, and H. Li, “Tip-adapter: Training-free clip-adapter for better vision-language modeling,” CoRR, vol. abs/2111.03930, 2021. [Online]. Available: https://arxiv.org/abs/2111.03930
  57. O. Pantazis, G. J. Brostow, K. E. Jones, and O. M. Aodha, “Svl-adapter: Self-supervised adapter for vision-language pretrained models,” in 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022.   BMVA Press, 2022, p. 580. [Online]. Available: https://bmvc2022.mpi-inf.mpg.de/580/
  58. F. Peng, X. Yang, and C. Xu, “Sgva-clip: Semantic-guided visual adapting of vision-language models for few-shot image classification,” CoRR, vol. abs/2211.16191, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.16191
  59. Z. Guo, R. Zhang, L. Qiu, X. Ma, X. Miao, X. He, and B. Cui, “CALIP: zero-shot enhancement of CLIP with parameter-free attention,” in Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, B. Williams, Y. Chen, and J. Neville, Eds.   AAAI Press, 2023, pp. 746–754. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/25152
  60. H. Jiang, J. Zhang, R. Huang, C. Ge, Z. Ni, J. Lu, J. Zhou, S. Song, and G. Huang, “Cross-modal adapter for text-video retrieval,” CoRR, vol. abs/2211.09623, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.09623
  61. B. Zhang, X. Jin, W. Gong, K. Xu, Z. Zhang, P. Wang, X. Shen, and J. Feng, “Multimodal video adapter for parameter efficient video text retrieval,” CoRR, vol. abs/2301.07868, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2301.07868
  62. H. Lu, M. Ding, Y. Huo, G. Yang, Z. Lu, M. Tomizuka, and W. Zhan, “Uniadapter: Unified parameter-efficient transfer learning for cross-modal modeling,” CoRR, vol. abs/2302.06605, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2302.06605
  63. Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “ERNIE: enhanced language representation with informative entities,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez, Eds.   Association for Computational Linguistics, 2019, pp. 1441–1451. [Online]. Available: https://doi.org/10.18653/v1/p19-1139
  64. W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng, and P. Wang, “K-BERT: enabling language representation with knowledge graph,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020.   AAAI Press, 2020, pp. 2901–2908. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/5681
  65. F. Yu, J. Tang, W. Yin, Y. Sun, H. Tian, H. Wu, and H. Wang, “Ernie-vil: Knowledge enhanced vision-language representations through scene graphs,” in Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021.   AAAI Press, 2021, pp. 3208–3216. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/16431
  66. Y. Xing, Z. Shi, Z. Meng, G. Lakemeyer, Y. Ma, and R. Wattenhofer, “KM-BART: knowledge enhanced multimodal BART for visual commonsense generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds.   Association for Computational Linguistics, 2021, pp. 525–535. [Online]. Available: https://doi.org/10.18653/v1/2021.acl-long.44
  67. S. Shen, C. Li, X. Hu, Y. Xie, J. Yang, P. Zhang, Z. Gan, L. Wang, L. Yuan, C. Liu, K. Keutzer, T. Darrell, A. Rohrbach, and J. Gao, “K-LITE: learning transferable visual models with external knowledge,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper\_files/paper/2022/hash/63fef0802863f47775c3563e18cbba17-Abstract-Conference.html
  68. S. Menon and C. Vondrick, “Visual classification via description from large language models,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.   OpenReview.net, 2023. [Online]. Available: https://openreview.net/pdf?id=jlAjNL8z5cs
  69. Y. Yang, A. Panagopoulou, S. Zhou, D. Jin, C. Callison-Burch, and M. Yatskar, “Language in a bottle: Language model guided concept bottlenecks for interpretable image classification,” CoRR, vol. abs/2211.11158, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.11158
  70. C. Li, H. Liu, L. H. Li, P. Zhang, J. Aneja, J. Yang, P. Jin, H. Hu, Z. Liu, Y. J. Lee, and J. Gao, “ELEVATER: A benchmark and toolkit for evaluating language-augmented visual models,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper\_files/paper/2022/hash/3c4688b6a76f25f2311daa0d75a58f1a-Abstract-Datasets\_and\_Benchmarks.html
  71. V. Udandarao, A. Gupta, and S. Albanie, “Sus-x: Training-free name-only transfer of vision-language models,” CoRR, vol. abs/2211.16198, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.16198
  72. C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “LAION-5B: an open large-scale dataset for training next generation image-text models,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper\_files/paper/2022/hash/a1859debfb3b59d094f3504d5ebb6c25-Abstract-Datasets\_and\_Benchmarks.html
  73. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022.   IEEE, 2022, pp. 10 674–10 685. [Online]. Available: https://doi.org/10.1109/CVPR52688.2022.01042
  74. M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Kornblith, R. Roelofs, R. G. Lopes, H. Hajishirzi, A. Farhadi, H. Namkoong, and L. Schmidt, “Robust fine-tuning of zero-shot models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022.   IEEE, 2022, pp. 7949–7961. [Online]. Available: https://doi.org/10.1109/CVPR52688.2022.00780
  75. C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from CLIP,” in Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVIII, ser. Lecture Notes in Computer Science, S. Avidan, G. J. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds., vol. 13688.   Springer, 2022, pp. 696–712. [Online]. Available: https://doi.org/10.1007/978-3-031-19815-1\_40
  76. R. Zhang, L. Qiu, W. Zhang, and Z. Zeng, “VT-CLIP: enhancing vision-language models with visual-guided texts,” CoRR, vol. abs/2112.02399, 2021. [Online]. Available: https://arxiv.org/abs/2112.02399
  77. J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA.   IEEE Computer Society, 2009, pp. 248–255. [Online]. Available: https://doi.org/10.1109/CVPR.2009.5206848
  78. L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” Comput. Vis. Image Underst., vol. 106, no. 1, pp. 59–70, 2007. [Online]. Available: https://doi.org/10.1016/j.cviu.2005.09.012
  79. O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar, “Cats and dogs,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012.   IEEE Computer Society, 2012, pp. 3498–3505. [Online]. Available: https://doi.org/10.1109/CVPR.2012.6248092
  80. J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in 2013 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2013, Sydney, Australia, December 1-8, 2013.   IEEE Computer Society, 2013, pp. 554–561. [Online]. Available: https://doi.org/10.1109/ICCVW.2013.77
  81. M. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in Sixth Indian Conference on Computer Vision, Graphics & Image Processing, ICVGIP 2008, Bhubaneswar, India, 16-19 December 2008.   IEEE Computer Society, 2008, pp. 722–729. [Online]. Available: https://doi.org/10.1109/ICVGIP.2008.47
  82. L. Bossard, M. Guillaumin, and L. V. Gool, “Food-101 - mining discriminative components with random forests,” in Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI, ser. Lecture Notes in Computer Science, D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., vol. 8694.   Springer, 2014, pp. 446–461. [Online]. Available: https://doi.org/10.1007/978-3-319-10599-4\_29
  83. S. Maji, E. Rahtu, J. Kannala, M. B. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” CoRR, vol. abs/1306.5151, 2013. [Online]. Available: http://arxiv.org/abs/1306.5151
  84. J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010.   IEEE Computer Society, 2010, pp. 3485–3492. [Online]. Available: https://doi.org/10.1109/CVPR.2010.5539970
  85. K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” CoRR, vol. abs/1212.0402, 2012. [Online]. Available: http://arxiv.org/abs/1212.0402
  86. M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014.   IEEE Computer Society, 2014, pp. 3606–3613. [Online]. Available: https://doi.org/10.1109/CVPR.2014.461
  87. P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., vol. 12, no. 7, pp. 2217–2226, 2019. [Online]. Available: https://doi.org/10.1109/JSTARS.2019.2918242
  88. B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do imagenet classifiers generalize to imagenet?” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.   PMLR, 2019, pp. 5389–5400. [Online]. Available: http://proceedings.mlr.press/v97/recht19a.html
  89. H. Wang, S. Ge, Z. C. Lipton, and E. P. Xing, “Learning robust global representations by penalizing local predictive power,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 10 506–10 518. [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/3eefceb8087e964f89c2d59e8a249915-Abstract.html
  90. D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15 257–15 266.
  91. D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, D. Song, J. Steinhardt, and J. Gilmer, “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021.   IEEE, 2021, pp. 8320–8329. [Online]. Available: https://doi.org/10.1109/ICCV48922.2021.00823
  92. J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” CoRR, vol. abs/2304.00685, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.00685
  93. M. Shu, W. Nie, D. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao, “Test-time prompt tuning for zero-shot generalization in vision-language models,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper\_files/paper/2022/hash/5bf2b802e24106064dc547ae9283bb0c-Abstract-Conference.html
  94. C. Ma, Y. Liu, J. Deng, L. Xie, W. Dong, and C. Xu, “Understanding and mitigating overfitting in prompt tuning for vision-language models,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 9, pp. 4616–4629, 2023. [Online]. Available: https://doi.org/10.1109/TCSVT.2023.3245584
  95. J. Li, M. Gao, L. Wei, S. Tang, W. Zhang, M. Li, W. Ji, Q. Tian, T. Chua, and Y. Zhuang, “Gradient-regulated meta-prompt learning for generalizable vision-language models,” CoRR, vol. abs/2303.06571, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.06571
  96. H. Yao, R. Zhang, and C. Xu, “Visual-language prompt tuning with knowledge-guided context optimization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023.   IEEE, 2023, pp. 6757–6767. [Online]. Available: https://doi.org/10.1109/CVPR52729.2023.00653
  97. X. Liu, D. Wang, M. Li, Z. Duan, Y. Xu, B. Chen, and M. Zhou, “Patch-token aligned bayesian prompt learning for vision-language models,” CoRR, vol. abs/2303.09100, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.09100
  98. A. Bulat and G. Tzimiropoulos, “LASP: text-to-text optimization for language-aware soft prompting of vision & language models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023.   IEEE, 2023, pp. 23 232–23 241. [Online]. Available: https://doi.org/10.1109/CVPR52729.2023.02225
  99. A. Maurer, “A note on the PAC bayesian theorem,” CoRR, vol. cs.LG/0411099, 2004. [Online]. Available: http://arxiv.org/abs/cs.LG/0411099
  100. A. Sicilia, K. Atwell, M. Alikhani, and S. J. Hwang, “Pac-bayesian domain adaptation bounds for multiclass learners,” in Uncertainty in Artificial Intelligence, Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI 2022, 1-5 August 2022, Eindhoven, The Netherlands, ser. Proceedings of Machine Learning Research, J. Cussens and K. Zhang, Eds., vol. 180.   PMLR, 2022, pp. 1824–1834. [Online]. Available: https://proceedings.mlr.press/v180/sicilia22a.html
  101. K. Crammer, M. J. Kearns, and J. Wortman, “Learning from multiple sources,” in Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, B. Schölkopf, J. C. Platt, and T. Hofmann, Eds.   MIT Press, 2006, pp. 321–328. [Online]. Available: https://proceedings.neurips.cc/paper/2006/hash/0f21f0349462cacdc5796990d37760ae-Abstract.html
  102. J. Kang, J. K. Kang, J. Kim, K. Jeon, H. Chung, and B. Park, “Neural architecture search survey: A computer vision perspective,” Sensors, vol. 23, no. 3, p. 1713, 2023. [Online]. Available: https://doi.org/10.3390/s23031713
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Fan Liu (244 papers)
  2. Tianshu Zhang (12 papers)
  3. Wenwen Dai (1 paper)
  4. Delong Chen (24 papers)
  5. Wenwen Cai (2 papers)
  6. Xiaocong Zhou (4 papers)
Citations (14)