Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

iVPT: Improving Task-relevant Information Sharing in Visual Prompt Tuning by Cross-layer Dynamic Connection (2404.05207v1)

Published 8 Apr 2024 in cs.CV

Abstract: Recent progress has shown great potential of visual prompt tuning (VPT) when adapting pre-trained vision transformers to various downstream tasks. However, most existing solutions independently optimize prompts at each layer, thereby neglecting the usage of task-relevant information encoded in prompt tokens across layers. Additionally, existing prompt structures are prone to interference from task-irrelevant noise in input images, which can do harm to the sharing of task-relevant information. In this paper, we propose a novel VPT approach, \textbf{iVPT}. It innovatively incorporates a cross-layer dynamic connection (CDC) for input prompt tokens from adjacent layers, enabling effective sharing of task-relevant information. Furthermore, we design a dynamic aggregation (DA) module that facilitates selective sharing of information between layers. The combination of CDC and DA enhances the flexibility of the attention process within the VPT framework. Building upon these foundations, iVPT introduces an attentive reinforcement (AR) mechanism, by automatically identifying salient image tokens, which are further enhanced by prompt tokens in an additive manner. Extensive experiments on 24 image classification and semantic segmentation benchmarks clearly demonstrate the advantage of the proposed iVPT, compared to the state-of-the-art counterparts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Foundational models defining a new era in vision: A survey and outlook. arXiv preprint arXiv:2307.13721, 2023.
  2. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 2022.
  3. Beit: bert pre-training of image transformers. In ICLR, 2021.
  4. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.
  5. Bottou, L. Large-scale machine learning with stochastic gradient descent. In COMPSTAT, pp.  177–186. 2010.
  6. Language models are few-shot learners. NeurIPS, 2020.
  7. Adaptformer: Adapting vision transformers for scalable visual recognition. In NeurIPS, 2022.
  8. An empirical study of training self-supervised vision transformers. In ICCV, 2021.
  9. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 2017.
  10. Describing textures in the wild. In CVPR, 2014.
  11. Learning expressive prompting with residuals for vision transformers. In CVPR, 2023.
  12. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  13. An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, 2020.
  14. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In CVPR Workshop, 2004.
  15. Vision meets robotics: The kitti dataset. IJRR, 2013.
  16. Graham, B. Kaggle diabetic retinopathy detection competition report. University of Warwick, 2015.
  17. E^ 2vpt: An effective and efficient approach for visual prompt tuning. In ICCV, 2023.
  18. Sensitivity-aware visual parameter-efficient fine-tuning. In ICCV, 2023.
  19. Deep residual learning for image recognition. In CVPR, 2016.
  20. Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
  21. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  22. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J-STARS, 2019.
  23. Parameter-efficient transfer learning for nlp. In ICML, 2019.
  24. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  25. Diversity-aware meta visual prompting. In CVPR, 2023.
  26. Visual prompt tuning. In ECCV, 2022.
  27. Fact: Factor-tuning for lightweight adaptation on vision transformer. In AAAI, 2023.
  28. Revisiting the parameter efficiency of adapters from the perspective of precision redundancy. In ICCV, 2023.
  29. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
  30. Bert: pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
  31. Novel dataset for fine-grained image categorization: Stanford dogs. In CVPR Workshop on Fine-Grained Visual Categorization, 2011.
  32. Segment anything. In ICCV, 2023.
  33. Collecting a large-scale dataset of fine-grained cars. 2013.
  34. Krizhevsky, A. et al. Learning multiple layers of features from tiny images. Tech Report, University of Toronto, 2009.
  35. Learning methods for generic object recognition with invariance to pose and lighting. In CVPR, 2004.
  36. Delta: deep learning transfer using feature map with attention for convolutional networks. In ICLR, 2018.
  37. Scaling & shifting your features: A new baseline for efficient model tuning. In NeurIPS, 2022.
  38. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586, 2021a.
  39. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021b.
  40. Prompt generation networks for efficient adaptation of frozen vision transformers. arXiv preprint arXiv:2210.06466, 2022.
  41. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.
  42. Reading digits in natural images with unsupervised feature learning. 2011.
  43. Pro-tuning: Unified prompt tuning for vision tasks. TCSVT, 2023.
  44. Automated flower classification over a large number of classes. In ICVGIP, 2008.
  45. Cats and dogs. In CVPR, 2012.
  46. Language models as knowledge bases? In EMNLP-IJCNLP, 2019.
  47. Scaling vision with sparse mixture of experts. NeurIPS, 2021.
  48. On transferability of prompt tuning for natural language processing. In NAACL, 2022.
  49. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In CVPR, 2015.
  50. Rotation equivariant cnns for digital pathology. In MICCAI, 2018.
  51. The caltech-ucsd birds-200-2011 dataset. 2011.
  52. Adapting shortcut with normalizing flow: An efficient tuning framework for visual recognition. In CVPR, 2023a.
  53. Dualprompt: Complementary prompting for rehearsal-free continual learning. In ECCV, 2022a.
  54. Learning to prompt for continual learning. In CVPR, 2022b.
  55. Multitask prompt tuning enables parameter-efficient transfer learning. In ICLR, 2023b.
  56. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.
  57. Progressive visual prompt learning with contrastive feature re-formation. arXiv preprint arXiv:2304.08386, 2023.
  58. Explicit inductive bias for transfer learning with convolutional networks. In ICML, 2018.
  59. Improving visual prompt tuning for self-supervised vision transformers. ICML, 2023.
  60. Bitfit: simple parameter-efficient fine-tuning for transformer-based masked language-models. In ACL, 2022.
  61. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
  62. Sct: A simple baseline for parameter-efficient fine-tuning via salient channels. IJCV, 2023.
  63. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.
  64. Semantic understanding of scenes through the ade20k dataset. IJCV, 2019.
  65. Dr-tune: Improving fine-tuning of pretrained visual models by distribution regularization with semantic calibration. In ICCV, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Nan Zhou (71 papers)
  2. Jiaxin Chen (55 papers)
  3. Di Huang (203 papers)

Summary

We haven't generated a summary for this paper yet.