Few-shot Adaptation of Multi-modal Foundation Models: A Survey (2401.01736v2)
Abstract: Multi-modal (vision-language) models, such as CLIP, are replacing traditional supervised pre-training models (e.g., ImageNet-based pre-training) as the new generation of visual foundation models. These models with robust and aligned semantic representations learned from billions of internet image-text pairs and can be applied to various downstream tasks in a zero-shot manner. However, in some fine-grained domains like medical imaging and remote sensing, the performance of multi-modal foundation models often leaves much to be desired. Consequently, many researchers have begun to explore few-shot adaptation methods for these models, gradually deriving three main technical approaches: 1) prompt-based methods, 2) adapter-based methods, and 3) external knowledge-based methods. Nevertheless, this rapidly developing field has produced numerous results without a comprehensive survey to systematically organize the research progress. Therefore, in this survey, we introduce and analyze the research advancements in few-shot adaptation methods for multi-modal models, summarizing commonly used datasets and experimental setups, and comparing the results of different methods. In addition, due to the lack of reliable theoretical support for existing methods, we derive the few-shot adaptation generalization error bound for multi-modal models. The theorem reveals that the generalization error of multi-modal foundation models is constrained by three factors: domain gap, model capacity, and sample size. Based on this, we propose three possible solutions from the following aspects: 1) adaptive domain generalization, 2) adaptive model selection, and 3) adaptive knowledge utilization.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423
- W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang, K. Wang, X. Zhang, C. Li, Z. Gong, Y. Yao, X. Huang, J. Wang, J. Yu, Q. Guo, Y. Yu, Y. Zhang, J. Wang, H. Tao, D. Yan, Z. Yi, F. Peng, F. Jiang, H. Zhang, L. Deng, Y. Zhang, Z. Lin, C. Zhang, S. Zhang, M. Guo, S. Gu, G. Fan, Y. Wang, X. Jin, Q. Liu, and Y. Tian, “Pangu-α𝛼\alphaitalic_α: Large-scale autoregressive pretrained chinese language models with auto-parallel computation,” CoRR, vol. abs/2104.12369, 2021. [Online]. Available: https://arxiv.org/abs/2104.12369
- A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, “Palm: Scaling language modeling with pathways,” CoRR, vol. abs/2204.02311, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2204.02311
- OpenAI, “GPT-4 technical report,” CoRR, vol. abs/2303.08774, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.08774
- J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 13–23. [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/c74d97b01eae257e44aa9d5bade97baf-Abstract.html
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 2021, pp. 8748–8763. [Online]. Available: http://proceedings.mlr.press/v139/radford21a.html
- Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, “Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. [Online]. Available: https://openreview.net/forum?id=zq1iJkNk3uN
- L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu, “FILIP: fine-grained interactive language-image pre-training,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. [Online]. Available: https://openreview.net/forum?id=cpDhcsEDC2
- Y. Gao, J. Liu, Z. Xu, J. Zhang, K. Li, R. Ji, and C. Shen, “Pyramidclip: Hierarchical feature alignment for vision-language model pretraining,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper\_files/paper/2022/hash/e9882f7f7c44a10acc01132302bac9d8-Abstract-Conference.html
- H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once-for-all: Train one network and specialize it for efficient deployment,” in 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. [Online]. Available: https://openreview.net/forum?id=HylxE1HKwS
- W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign language: Beit pretraining for all vision and vision-language tasks,” CoRR, vol. abs/2208.10442, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2208.10442
- B. Shan, W. Yin, Y. Sun, H. Tian, H. Wu, and H. Wang, “Ernie-vil 2.0: Multi-view contrastive learning for image-text pre-training,” CoRR, vol. abs/2209.15270, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2209.15270
- A. Baevski, W. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 2022, pp. 1298–1312. [Online]. Available: https://proceedings.mlr.press/v162/baevski22a.html
- K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” Int. J. Comput. Vis., vol. 130, no. 9, pp. 2337–2348, 2022. [Online]. Available: https://doi.org/10.1007/s11263-022-01653-1
- P. Gao, S. Geng, R. Zhang, T. Ma, R. Fang, Y. Zhang, H. Li, and Y. Qiao, “Clip-adapter: Better vision-language models with feature adapters,” CoRR, vol. abs/2110.04544, 2021. [Online]. Available: https://arxiv.org/abs/2110.04544
- S. M. Pratt, R. Liu, and A. Farhadi, “What does a platypus look like? generating customized prompts for zero-shot image classification,” CoRR, vol. abs/2209.03320, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2209.03320
- A. Radford and K. Narasimhan, “Improving language understanding by generative pre-training,” 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:49313245
- T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 2020, pp. 1597–1607. [Online]. Available: http://proceedings.mlr.press/v119/chen20j.html
- K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick, “Momentum contrast for unsupervised visual representation learning,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 2020, pp. 9726–9735. [Online]. Available: https://doi.org/10.1109/CVPR42600.2020.00975
- K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. B. Girshick, “Masked autoencoders are scalable vision learners,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2022, pp. 15 979–15 988. [Online]. Available: https://doi.org/10.1109/CVPR52688.2022.01553
- H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: BERT pre-training of image transformers,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. [Online]. Available: https://openreview.net/forum?id=p-BhZSz59o4
- C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 2021, pp. 4904–4916. [Online]. Available: http://proceedings.mlr.press/v139/jia21b.html
- Y. Huo, M. Zhang, G. Liu, H. Lu, Y. Gao, G. Yang, J. Wen, H. Zhang, B. Xu, W. Zheng, Z. Xi, Y. Yang, A. Hu, J. Zhao, R. Li, Y. Zhao, L. Zhang, Y. Song, X. Hong, W. Cui, D. Y. Hou, Y. Li, J. Li, P. Liu, Z. Gong, C. Jin, Y. Sun, S. Chen, Z. Lu, Z. Dou, Q. Jin, Y. Lan, W. X. Zhao, R. Song, and J. Wen, “Wenlan: Bridging vision and language by large-scale multi-modal pre-training,” CoRR, vol. abs/2103.06561, 2021. [Online]. Available: https://arxiv.org/abs/2103.06561
- J. Wang, H. Wang, J. Deng, W. Wu, and D. Zhang, “Efficientclip: Efficient cross-modal pre-training by ensemble confident learning and language modeling,” CoRR, vol. abs/2109.04699, 2021. [Online]. Available: https://arxiv.org/abs/2109.04699
- A. Tejankar, M. Sanjabi, B. Wu, S. Xie, M. Khabsa, H. Pirsiavash, and H. Firooz, “A fistful of words: Learning transferable visual models from bag-of-words supervision,” CoRR, vol. abs/2112.13884, 2021. [Online]. Available: https://arxiv.org/abs/2112.13884
- N. Mu, A. Kirillov, D. A. Wagner, and S. Xie, “SLIP: self-supervision meets language-image pre-training,” in Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVI, ser. Lecture Notes in Computer Science, S. Avidan, G. J. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds., vol. 13686. Springer, 2022, pp. 529–544. [Online]. Available: https://doi.org/10.1007/978-3-031-19809-0\_30
- J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 2022, pp. 12 888–12 900. [Online]. Available: https://proceedings.mlr.press/v162/li22n.html
- A. Fürst, E. Rumetshofer, J. Lehner, V. T. Tran, F. Tang, H. Ramsauer, D. P. Kreil, M. Kopp, G. Klambauer, A. Bitto, and S. Hochreiter, “CLOOB: modern hopfield networks with infoloob outperform CLIP,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper\_files/paper/2022/hash/8078e76f913e31b8467e85b4c0f0d22b-Abstract-Conference.html
- D. Chen, Z. Wu, F. Liu, Z. Yang, Y. Huang, Y. Bao, and E. Zhou, “Prototypical contrastive language image pretraining,” CoRR, vol. abs/2206.10996, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2206.10996
- J. Wang, Y. Zhang, L. Zhang, P. Yang, X. Gao, Z. Wu, X. Dong, J. He, J. Zhuo, Q. Yang, Y. Huang, X. Li, Y. Wu, J. Lu, X. Zhu, W. Chen, T. Han, K. Pan, R. Wang, H. Wang, X. Wu, Z. Zeng, C. Chen, R. Gan, and J. Zhang, “Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence,” CoRR, vol. abs/2209.02970, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2209.02970
- A. Yang, J. Pan, J. Lin, R. Men, Y. Zhang, J. Zhou, and C. Zhou, “Chinese CLIP: contrastive vision-language pretraining in chinese,” CoRR, vol. abs/2211.01335, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.01335
- Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao, “EVA-CLIP: improved training techniques for CLIP at scale,” CoRR, vol. abs/2303.15389, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.15389
- Z. Lin, S. Yu, Z. Kuang, D. Pathak, and D. Ramanan, “Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 2023, pp. 19 325–19 337. [Online]. Available: https://doi.org/10.1109/CVPR52729.2023.01852
- B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computational Linguistics, 2021, pp. 3045–3059. [Online]. Available: https://doi.org/10.18653/v1/2021.emnlp-main.243
- T. Shin, Y. Razeghi, R. L. L. IV, E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Association for Computational Linguistics, 2020, pp. 4222–4235. [Online]. Available: https://doi.org/10.18653/v1/2020.emnlp-main.346
- X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Association for Computational Linguistics, 2021, pp. 4582–4597. [Online]. Available: https://doi.org/10.18653/v1/2021.acl-long.353
- L. Reynolds and K. McDonell, “Prompt programming for large language models: Beyond the few-shot paradigm,” in CHI ’21: CHI Conference on Human Factors in Computing Systems, Virtual Event / Yokohama Japan, May 8-13, 2021, Extended Abstracts, Y. Kitamura, A. Quigley, K. Isbister, and T. Igarashi, Eds. ACM, 2021, pp. 314:1–314:7. [Online]. Available: https://doi.org/10.1145/3411763.3451760
- X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, “GPT understands, too,” CoRR, vol. abs/2103.10385, 2021. [Online]. Available: https://arxiv.org/abs/2103.10385
- K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Conditional prompt learning for vision-language models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2022, pp. 16 795–16 804. [Online]. Available: https://doi.org/10.1109/CVPR52688.2022.01631
- T. Huang, J. Chu, and F. Wei, “Unsupervised prompt learning for vision-language models,” CoRR, vol. abs/2204.03649, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2204.03649
- B. Zhu, Y. Niu, Y. Han, Y. Wu, and H. Zhang, “Prompt-aligned gradient for prompt tuning,” CoRR, vol. abs/2205.14865, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.14865
- G. Chen, W. Yao, X. Song, X. Li, Y. Rao, and K. Zhang, “PLOT: prompt learning with optimal transport for vision-language models,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. Available: https://openreview.net/pdf?id=zqwryBoXYnh
- Y. Lu, J. Liu, Y. Zhang, Y. Liu, and X. Tian, “Prompt distribution learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2022, pp. 5196–5205. [Online]. Available: https://doi.org/10.1109/CVPR52688.2022.00514
- K. Ding, Y. Wang, P. Liu, Q. Yu, H. Zhang, S. Xiang, and C. Pan, “Prompt tuning with soft context sharing for vision-language models,” CoRR, vol. abs/2208.13474, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2208.13474
- M. Jia, L. Tang, B. Chen, C. Cardie, S. J. Belongie, B. Hariharan, and S. Lim, “Visual prompt tuning,” in Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIII, ser. Lecture Notes in Computer Science, S. Avidan, G. J. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds., vol. 13693. Springer, 2022, pp. 709–727. [Online]. Available: https://doi.org/10.1007/978-3-031-19827-4\_41
- Y. Xing, Q. Wu, D. Cheng, S. Zhang, G. Liang, and Y. Zhang, “Class-aware visual prompt tuning for vision-language pre-trained model,” CoRR, vol. abs/2208.08340, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2208.08340
- H. Bahng, A. Jahanian, S. Sankaranarayanan, and P. Isola, “Exploring visual prompts for adapting large-scale models,” 2022.
- J. Wu, X. Li, C. Wei, H. Wang, A. L. Yuille, Y. Zhou, and C. Xie, “Unleashing the power of visual prompting at the pixel level,” CoRR, vol. abs/2212.10556, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2212.10556
- M. U. Khattak, H. A. Rasheed, M. Maaz, S. Khan, and F. S. Khan, “Maple: Multi-modal prompt learning,” CoRR, vol. abs/2210.03117, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2210.03117
- Y. Zang, W. Li, K. Zhou, C. Huang, and C. C. Loy, “Unified vision and language prompt learning,” CoRR, vol. abs/2210.07225, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2210.07225
- S. Shen, S. Yang, T. Zhang, B. Zhai, J. E. Gonzalez, K. Keutzer, and T. Darrell, “Multitask vision-language prompt tuning,” CoRR, vol. abs/2211.11720, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.11720
- N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for NLP,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 2790–2799. [Online]. Available: http://proceedings.mlr.press/v97/houlsby19a.html
- N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C. Chan, W. Chen, J. Yi, W. Zhao, X. Wang, Z. Liu, H. Zheng, J. Chen, Y. Liu, J. Tang, J. Li, and M. Sun, “Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models,” CoRR, vol. abs/2203.06904, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2203.06904
- S. Jie and Z. Deng, “Convolutional bypasses are better vision transformer adapters,” CoRR, vol. abs/2207.07039, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2207.07039
- ——, “Fact: Factor-tuning for lightweight adaptation on vision transformer,” in Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, B. Williams, Y. Chen, and J. Neville, Eds. AAAI Press, 2023, pp. 1060–1068. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/25187
- R. Zhang, R. Fang, W. Zhang, P. Gao, K. Li, J. Dai, Y. Qiao, and H. Li, “Tip-adapter: Training-free clip-adapter for better vision-language modeling,” CoRR, vol. abs/2111.03930, 2021. [Online]. Available: https://arxiv.org/abs/2111.03930
- O. Pantazis, G. J. Brostow, K. E. Jones, and O. M. Aodha, “Svl-adapter: Self-supervised adapter for vision-language pretrained models,” in 33rd British Machine Vision Conference 2022, BMVC 2022, London, UK, November 21-24, 2022. BMVA Press, 2022, p. 580. [Online]. Available: https://bmvc2022.mpi-inf.mpg.de/580/
- F. Peng, X. Yang, and C. Xu, “Sgva-clip: Semantic-guided visual adapting of vision-language models for few-shot image classification,” CoRR, vol. abs/2211.16191, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.16191
- Z. Guo, R. Zhang, L. Qiu, X. Ma, X. Miao, X. He, and B. Cui, “CALIP: zero-shot enhancement of CLIP with parameter-free attention,” in Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, B. Williams, Y. Chen, and J. Neville, Eds. AAAI Press, 2023, pp. 746–754. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/25152
- H. Jiang, J. Zhang, R. Huang, C. Ge, Z. Ni, J. Lu, J. Zhou, S. Song, and G. Huang, “Cross-modal adapter for text-video retrieval,” CoRR, vol. abs/2211.09623, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.09623
- B. Zhang, X. Jin, W. Gong, K. Xu, Z. Zhang, P. Wang, X. Shen, and J. Feng, “Multimodal video adapter for parameter efficient video text retrieval,” CoRR, vol. abs/2301.07868, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2301.07868
- H. Lu, M. Ding, Y. Huo, G. Yang, Z. Lu, M. Tomizuka, and W. Zhan, “Uniadapter: Unified parameter-efficient transfer learning for cross-modal modeling,” CoRR, vol. abs/2302.06605, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2302.06605
- Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “ERNIE: enhanced language representation with informative entities,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez, Eds. Association for Computational Linguistics, 2019, pp. 1441–1451. [Online]. Available: https://doi.org/10.18653/v1/p19-1139
- W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng, and P. Wang, “K-BERT: enabling language representation with knowledge graph,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 2020, pp. 2901–2908. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/5681
- F. Yu, J. Tang, W. Yin, Y. Sun, H. Tian, H. Wu, and H. Wang, “Ernie-vil: Knowledge enhanced vision-language representations through scene graphs,” in Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021. AAAI Press, 2021, pp. 3208–3216. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/16431
- Y. Xing, Z. Shi, Z. Meng, G. Lakemeyer, Y. Ma, and R. Wattenhofer, “KM-BART: knowledge enhanced multimodal BART for visual commonsense generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Association for Computational Linguistics, 2021, pp. 525–535. [Online]. Available: https://doi.org/10.18653/v1/2021.acl-long.44
- S. Shen, C. Li, X. Hu, Y. Xie, J. Yang, P. Zhang, Z. Gan, L. Wang, L. Yuan, C. Liu, K. Keutzer, T. Darrell, A. Rohrbach, and J. Gao, “K-LITE: learning transferable visual models with external knowledge,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper\_files/paper/2022/hash/63fef0802863f47775c3563e18cbba17-Abstract-Conference.html
- S. Menon and C. Vondrick, “Visual classification via description from large language models,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. Available: https://openreview.net/pdf?id=jlAjNL8z5cs
- Y. Yang, A. Panagopoulou, S. Zhou, D. Jin, C. Callison-Burch, and M. Yatskar, “Language in a bottle: Language model guided concept bottlenecks for interpretable image classification,” CoRR, vol. abs/2211.11158, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.11158
- C. Li, H. Liu, L. H. Li, P. Zhang, J. Aneja, J. Yang, P. Jin, H. Hu, Z. Liu, Y. J. Lee, and J. Gao, “ELEVATER: A benchmark and toolkit for evaluating language-augmented visual models,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper\_files/paper/2022/hash/3c4688b6a76f25f2311daa0d75a58f1a-Abstract-Datasets\_and\_Benchmarks.html
- V. Udandarao, A. Gupta, and S. Albanie, “Sus-x: Training-free name-only transfer of vision-language models,” CoRR, vol. abs/2211.16198, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.16198
- C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, R. Kaczmarczyk, and J. Jitsev, “LAION-5B: an open large-scale dataset for training next generation image-text models,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper\_files/paper/2022/hash/a1859debfb3b59d094f3504d5ebb6c25-Abstract-Datasets\_and\_Benchmarks.html
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2022, pp. 10 674–10 685. [Online]. Available: https://doi.org/10.1109/CVPR52688.2022.01042
- M. Wortsman, G. Ilharco, J. W. Kim, M. Li, S. Kornblith, R. Roelofs, R. G. Lopes, H. Hajishirzi, A. Farhadi, H. Namkoong, and L. Schmidt, “Robust fine-tuning of zero-shot models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2022, pp. 7949–7961. [Online]. Available: https://doi.org/10.1109/CVPR52688.2022.00780
- C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from CLIP,” in Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVIII, ser. Lecture Notes in Computer Science, S. Avidan, G. J. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds., vol. 13688. Springer, 2022, pp. 696–712. [Online]. Available: https://doi.org/10.1007/978-3-031-19815-1\_40
- R. Zhang, L. Qiu, W. Zhang, and Z. Zeng, “VT-CLIP: enhancing vision-language models with visual-guided texts,” CoRR, vol. abs/2112.02399, 2021. [Online]. Available: https://arxiv.org/abs/2112.02399
- J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA. IEEE Computer Society, 2009, pp. 248–255. [Online]. Available: https://doi.org/10.1109/CVPR.2009.5206848
- L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” Comput. Vis. Image Underst., vol. 106, no. 1, pp. 59–70, 2007. [Online]. Available: https://doi.org/10.1016/j.cviu.2005.09.012
- O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar, “Cats and dogs,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012. IEEE Computer Society, 2012, pp. 3498–3505. [Online]. Available: https://doi.org/10.1109/CVPR.2012.6248092
- J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in 2013 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2013, Sydney, Australia, December 1-8, 2013. IEEE Computer Society, 2013, pp. 554–561. [Online]. Available: https://doi.org/10.1109/ICCVW.2013.77
- M. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in Sixth Indian Conference on Computer Vision, Graphics & Image Processing, ICVGIP 2008, Bhubaneswar, India, 16-19 December 2008. IEEE Computer Society, 2008, pp. 722–729. [Online]. Available: https://doi.org/10.1109/ICVGIP.2008.47
- L. Bossard, M. Guillaumin, and L. V. Gool, “Food-101 - mining discriminative components with random forests,” in Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI, ser. Lecture Notes in Computer Science, D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., vol. 8694. Springer, 2014, pp. 446–461. [Online]. Available: https://doi.org/10.1007/978-3-319-10599-4\_29
- S. Maji, E. Rahtu, J. Kannala, M. B. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” CoRR, vol. abs/1306.5151, 2013. [Online]. Available: http://arxiv.org/abs/1306.5151
- J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010. IEEE Computer Society, 2010, pp. 3485–3492. [Online]. Available: https://doi.org/10.1109/CVPR.2010.5539970
- K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” CoRR, vol. abs/1212.0402, 2012. [Online]. Available: http://arxiv.org/abs/1212.0402
- M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014. IEEE Computer Society, 2014, pp. 3606–3613. [Online]. Available: https://doi.org/10.1109/CVPR.2014.461
- P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens., vol. 12, no. 7, pp. 2217–2226, 2019. [Online]. Available: https://doi.org/10.1109/JSTARS.2019.2918242
- B. Recht, R. Roelofs, L. Schmidt, and V. Shankar, “Do imagenet classifiers generalize to imagenet?” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 2019, pp. 5389–5400. [Online]. Available: http://proceedings.mlr.press/v97/recht19a.html
- H. Wang, S. Ge, Z. C. Lipton, and E. P. Xing, “Learning robust global representations by penalizing local predictive power,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 10 506–10 518. [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/3eefceb8087e964f89c2d59e8a249915-Abstract.html
- D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song, “Natural adversarial examples,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 15 257–15 266.
- D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, D. Song, J. Steinhardt, and J. Gilmer, “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 2021, pp. 8320–8329. [Online]. Available: https://doi.org/10.1109/ICCV48922.2021.00823
- J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” CoRR, vol. abs/2304.00685, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2304.00685
- M. Shu, W. Nie, D. Huang, Z. Yu, T. Goldstein, A. Anandkumar, and C. Xiao, “Test-time prompt tuning for zero-shot generalization in vision-language models,” in NeurIPS, 2022. [Online]. Available: http://papers.nips.cc/paper\_files/paper/2022/hash/5bf2b802e24106064dc547ae9283bb0c-Abstract-Conference.html
- C. Ma, Y. Liu, J. Deng, L. Xie, W. Dong, and C. Xu, “Understanding and mitigating overfitting in prompt tuning for vision-language models,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 9, pp. 4616–4629, 2023. [Online]. Available: https://doi.org/10.1109/TCSVT.2023.3245584
- J. Li, M. Gao, L. Wei, S. Tang, W. Zhang, M. Li, W. Ji, Q. Tian, T. Chua, and Y. Zhuang, “Gradient-regulated meta-prompt learning for generalizable vision-language models,” CoRR, vol. abs/2303.06571, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.06571
- H. Yao, R. Zhang, and C. Xu, “Visual-language prompt tuning with knowledge-guided context optimization,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 2023, pp. 6757–6767. [Online]. Available: https://doi.org/10.1109/CVPR52729.2023.00653
- X. Liu, D. Wang, M. Li, Z. Duan, Y. Xu, B. Chen, and M. Zhou, “Patch-token aligned bayesian prompt learning for vision-language models,” CoRR, vol. abs/2303.09100, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.09100
- A. Bulat and G. Tzimiropoulos, “LASP: text-to-text optimization for language-aware soft prompting of vision & language models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 2023, pp. 23 232–23 241. [Online]. Available: https://doi.org/10.1109/CVPR52729.2023.02225
- A. Maurer, “A note on the PAC bayesian theorem,” CoRR, vol. cs.LG/0411099, 2004. [Online]. Available: http://arxiv.org/abs/cs.LG/0411099
- A. Sicilia, K. Atwell, M. Alikhani, and S. J. Hwang, “Pac-bayesian domain adaptation bounds for multiclass learners,” in Uncertainty in Artificial Intelligence, Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI 2022, 1-5 August 2022, Eindhoven, The Netherlands, ser. Proceedings of Machine Learning Research, J. Cussens and K. Zhang, Eds., vol. 180. PMLR, 2022, pp. 1824–1834. [Online]. Available: https://proceedings.mlr.press/v180/sicilia22a.html
- K. Crammer, M. J. Kearns, and J. Wortman, “Learning from multiple sources,” in Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, B. Schölkopf, J. C. Platt, and T. Hofmann, Eds. MIT Press, 2006, pp. 321–328. [Online]. Available: https://proceedings.neurips.cc/paper/2006/hash/0f21f0349462cacdc5796990d37760ae-Abstract.html
- J. Kang, J. K. Kang, J. Kim, K. Jeon, H. Chung, and B. Park, “Neural architecture search survey: A computer vision perspective,” Sensors, vol. 23, no. 3, p. 1713, 2023. [Online]. Available: https://doi.org/10.3390/s23031713
- Fan Liu (244 papers)
- Tianshu Zhang (12 papers)
- Wenwen Dai (1 paper)
- Delong Chen (24 papers)
- Wenwen Cai (2 papers)
- Xiaocong Zhou (4 papers)