AdaViPro: Region-based Adaptive Visual Prompt for Large-Scale Models Adapting (2403.13282v2)
Abstract: Recently, prompt-based methods have emerged as a new alternative parameter-efficient fine-tuning' paradigm, which only fine-tunes a small number of additional parameters while keeping the original model frozen. However, despite achieving notable results, existing prompt methods mainly focus onwhat to add', while overlooking the equally important aspect of where to add', typically relying on the manually crafted placement. To this end, we propose a region-based Adaptive Visual Prompt, named AdaViPro, which integrates thewhere to add' optimization of the prompt into the learning process. Specifically, we reconceptualize the `where to add' optimization as a problem of regional decision-making. During inference, AdaViPro generates a regionalized mask map for the whole image, which is composed of 0 and 1, to designate whether to apply or discard the prompt in each specific area. Therefore, we employ Gumbel-Softmax sampling to enable AdaViPro's end-to-end learning through standard back-propagation. Extensive experiments demonstrate that our AdaViPro yields new efficiency and accuracy trade-offs for adapting pre-trained models.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” NIPS, vol. 30, 2017.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” NIPS, vol. 33, pp. 1877–1901, 2020.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- J. Wang, C. Zhang, J. Huang, B. Ren, and Z. Deng, “Improving scene graph generation with superpixel-based interaction learning,” in ACM MM, 2023, pp. 1809–1820.
- Y. Tian, M. Yang, L. Zhang, Z. Zhang, Y. Liu, X. Xie, X. Que, and W. Wang, “View while moving: Efficient video recognition in long-untrimmed videos,” in ACM MM, 2023, pp. 173–183.
- W. Liu, T. He, C. Gong, N. Zhang, H. Yang, and J. Yan, “Fine-grained music plagiarism detection: Revealing plagiarists through bipartite graph matching and a comprehensive large-scale dataset,” in ACM MM, 2023, pp. 8839–8848.
- Q. Huang, X. Dong, D. Chen, W. Zhang, F. Wang, G. Hua, and N. Yu, “Diversity-aware meta visual prompting,” in CVPR, 2023, pp. 10 878–10 887.
- M. Wang, J. Xing, J. Mei, Y. Liu, and Y. Jiang, “Actionclip: Adapting language-image pretrained models for video action recognition,” TNNLS, 2023.
- H. Bahng, A. Jahanian, S. Sankaranarayanan, and P. Isola, “Exploring visual prompts for adapting large-scale models,” arXiv preprint arXiv:2203.17274, 2022.
- K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” IJCV, vol. 130, no. 9, pp. 2337–2348, 2022.
- M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in ECCV. Springer, 2022, pp. 709–727.
- C. Ju, T. Han, K. Zheng, Y. Zhang, and W. Xie, “Prompting visual-language models for efficient video understanding,” in ECCV. Springer, 2022, pp. 105–124.
- T. Gao, A. Fisch, and D. Chen, “Making pre-trained language models better few-shot learners,” arXiv preprint arXiv:2012.15723, 2020.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in CVPR, 2020, pp. 9729–9738.
- K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in CVPR, 2022, pp. 16 000–16 009.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML. PMLR, 2021, pp. 8748–8763.
- X. Liang, D. Wang, Q. Wang, B. Wan, L. An, and L. He, “Language-guided visual aggregation network for video question answering,” in ACM MM, 2023, pp. 5195–5203.
- Y. Liu, M. Yang, Y. Tian, L. Zhang, X. Que, and W. Wang, “Cost-effective modality selection for video popularity prediction,” in IJCNN. IEEE, 2023, pp. 1–8.
- X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
- A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
- M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing textures in the wild,” in CVPR, 2014, pp. 3606–3613.
- L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” in ECCV. Springer, 2014, pp. 446–461.
- J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun database: Large-scale scene recognition from abbey to zoo,” in CVPR. IEEE, 2010, pp. 3485–3492.
- M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in ICVGIP. IEEE, 2008, pp. 722–729.
- K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
- P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” JSTARS, vol. 12, no. 7, pp. 2217–2226, 2019.
- O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in CVPR. IEEE, 2012, pp. 3498–3505.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.