Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning (2303.14773v2)

Published 26 Mar 2023 in cs.CV, cs.AI, and cs.LG

Abstract: With the surge of large-scale pre-trained models (PTMs), fine-tuning these models to numerous downstream tasks becomes a crucial problem. Consequently, parameter efficient transfer learning (PETL) of large models has grasped huge attention. While recent PETL methods showcase impressive performance, they rely on optimistic assumptions: 1) the entire parameter set of a PTM is available, and 2) a sufficiently large memory capacity for the fine-tuning is equipped. However, in most real-world applications, PTMs are served as a black-box API or proprietary software without explicit parameter accessibility. Besides, it is hard to meet a large memory requirement for modern PTMs. In this work, we propose black-box visual prompting (BlackVIP), which efficiently adapts the PTMs without knowledge about model architectures and parameters. BlackVIP has two components; 1) Coordinator and 2) simultaneous perturbation stochastic approximation with gradient correction (SPSA-GC). The Coordinator designs input-dependent image-shaped visual prompts, which improves few-shot adaptation and robustness on distribution/location shift. SPSA-GC efficiently estimates the gradient of a target model to update Coordinator. Extensive experiments on 16 datasets demonstrate that BlackVIP enables robust adaptation to diverse domains without accessing PTMs' parameters, with minimal memory requirements. Code: \url{https://github.com/changdaeoh/BlackVIP}

BlackVIP: An Innovation in Parameter-Efficient Transfer Learning

The significant proliferation of large-scale pre-trained models (PTMs) in various domains necessitates efficient adaptation mechanisms for diverse downstream tasks. Recent approaches in Parameter-Efficient Transfer Learning (PETL) have focused on maximizing performance without accessing extensive model parameters. The paper, "BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning," foregrounds an innovative methodology called BlackVIP designed to address intrinsic constraints of existing PETL methods, particularly in scenarios where PTMs are accessible only as black-box APIs.

Core Contributions

BlackVIP fundamentally departs from traditional fine-tuning and visual prompting methodologies by hypothesizing that knowledge of the model's architecture and parameters is not requisite for efficient adaptation. The paper introduces two crucial components:

  1. Coordinator: This module utilizes an input-dependent design to embed visual prompts, enhancing robustness in scenarios of distribution and object-location shifts. By employing a pre-trained self-supervised learning encoder and a light-weight, learnable decoder, BlackVIP adjusts the visual prompt based on each input image, unlike prior approaches that rely on fixed, universal prompts.
  2. SPSA-GC (Simultaneous Perturbation Stochastic Approximation with Gradient Correction): A novel gradient estimation technique that eschews dependence on backpropagation, thereby optimizing the prompt in contexts where direct parameter access isn't feasible. This effectively reduces memory requirements and facilitates scalable model adaptation.

Methodological Advancements

The distinguishing aspect of BlackVIP is its capability to implement black-box optimization using SPSA-GC without necessitating substantial memory resources typically required for backpropagation. This is particularly advantageous where model parameter inaccessibility or memory limitations are barriers. The paper contrasts BlackVIP with existing strategies by emphasizing its aptitude for a wide range of real-world scenarios, including those with limited computational resources.

Empirical Validation

Through extensive empirical analysis across 16 diverse datasets, BlackVIP manifests enhanced adaptability and robustness compared to state-of-the-art baselines. Notably, its performance on distribution-shift and few-shot tasks underscores its architectural generality and capability for broad applications. These results demonstrate that BlackVIP can achieve competitive, even superior, performance when the pre-trained models are concealed behind black-box interfaces. The 9K learnable parameters in BlackVIP, significantly fewer than those in other methods, further emphasize its efficiency.

Theoretical and Practical Implications

From a theoretical standpoint, BlackVIP contributes to discussions on minimizing learning perturbations while optimizing generalization across unseen data distributions, emphasizing the crucial role of input-space manipulation. Practically, this approach aligns with the needs of commercial and proprietary software applications where exposure to model innards is often restricted due to intellectual property or accessibility concerns.

Future Directions

BlackVIP opens avenues for research on further reducing computational overheads associated with model adaptation. There exist opportunities to explore the integration of more complex neural architectures within the Coordinator, potentially amplifying the efficacy of visual prompt generation. Moreover, extending BlackVIP's applicability to other modalities, such as text or multimodal settings, could prove notable.

In summary, BlackVIP delineates a significant step towards enhancing the robustness and efficiency of PTM adaptation beyond conventional parameter-intrusive measures. As PTMs continue to evolve, approaches like BlackVIP will likely become quintessential in bridging the adaptation capabilities of these models with practical constraints imposed by commercial and real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  2. Learning de-biased representations with biased representations. In International Conference on Machine Learning, pages 528–539. PMLR, 2020.
  3. Visual prompting: Modifying pixel space to adapt pre-trained models. arXiv preprint arXiv:2203.17274, 2022.
  4. Beit: Bert pre-training of image transformers. In International Conference on Learning Representations, 2021.
  5. Visual prompting via image inpainting. arXiv preprint arXiv:2209.00647, 2022.
  6. Simultaneous perturbation stochastic approximation for few-shot learning. In 2020 European Control Conference (ECC), pages 350–355, 2020.
  7. Food-101 – mining discriminative components with random forests. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 446–461, Cham, 2014. Springer International Publishing.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
  10. Pin-Yu Chen. Model reprogramming: Resource-efficient cross-domain machine learning. arXiv preprint arXiv:2202.10629, 2022.
  11. Adaptformer: Adapting vision transformers for scalable visual recognition. arXiv preprint arXiv:2205.13535, 2022.
  12. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  13. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
  14. François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
  15. Describing textures in the wild. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014.
  16. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  17. Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv preprint arXiv:2205.12548, 2022.
  18. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  19. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  20. Adversarial reprogramming of neural networks. arXiv preprint arXiv:1806.11146, 2018.
  21. Unleashing vanilla vision transformer with masked image modeling for object detection. arXiv preprint arXiv:2204.02964, 2022.
  22. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 Conference on Computer Vision and Pattern Recognition Workshop, pages 178–178, 2004.
  23. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  24. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  25. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
  26. Text generation with efficient (soft) q-learning. arXiv preprint arXiv:2106.07704, 2021.
  27. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evolutionary computation, 11(1):1–18, 2003.
  28. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2):159–195, 2001.
  29. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  30. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  31. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  32. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  33. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  34. A survey of self-supervised and few-shot object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  35. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  36. Visual prompt tuning. arXiv preprint arXiv:2203.12119, 2022.
  37. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
  38. Prompting visual-language models for efficient video understanding. arXiv preprint arXiv:2112.04478, 2021.
  39. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  40. Maple: Multi-modal prompt learning. arXiv preprint arXiv:2210.03117, 2022.
  41. 3d object representations for fine-grained categorization. In 2013 IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013.
  42. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2021.
  43. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  44. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  45. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  46. Benchmarking detection transfer learning with vision transformers. arXiv preprint arXiv:2111.11429, 2021.
  47. Self-supervised learning is more robust to dataset imbalance. In International Conference on Learning Representations, 2022.
  48. A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications. IEEE Signal Processing Magazine, 37(5):43–54, 2020.
  49. Zeroth-order stochastic variance reduction for nonconvex optimization. Advances in Neural Information Processing Systems, 31, 2018.
  50. Prompt generation networks for efficient adaptation of frozen vision transformers. arXiv preprint arXiv:2210.06466, 2022.
  51. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  52. Reprogramming large pretrained language models for antibody sequence infilling. arXiv preprint arXiv:2210.07144, 2022.
  53. Cross-modal adversarial reprogramming. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2427–2435, 2022.
  54. Yurii Nesterov. A method for solving the convex programming problem with convergence rate o⁢(1/k2)𝑜1superscript𝑘2o(1/k^{2})italic_o ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Proceedings of the USSR Academy of Sciences, 269:543–547, 1983.
  55. Random gradient-free minimization of convex functions. Found. Comput. Math., 17(2):527–566, apr 2017.
  56. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
  57. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729, 2008.
  58. Towards understanding why mask-reconstruction pretraining helps in downstream tasks. arXiv preprint arXiv:2206.03826, 2022.
  59. The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pages 372–387, 2016.
  60. Cats and dogs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3498–3505, 2012.
  61. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  62. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247, 2020.
  63. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  64. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
  65. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  66. Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15638–15650, 2022.
  67. Robust neural network tracking controller using simultaneous perturbation stochastic approximation. IEEE Transactions on Neural Networks, 19(5):817–835, 2008.
  68. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  69. J.C. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3):332–341, 1992.
  70. J.C. Spall. Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transactions on Automatic Control, 45(10):1839–1853, 2000.
  71. James C. Spall. Accelerated second-order stochastic optimization using only function measurements. Proceedings of the 36th IEEE Conference on Decision and Control, 2:1417–1424 vol.2, 1997.
  72. James C. Spall. A one-measurement form of simultaneous perturbation stochastic approximation. Automatica, 33(1):109–112, 1997.
  73. James C Spall. An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins apl technical digest, 19(4):482–492, 1998.
  74. James C. Spall. Introduction to Stochastic Search and Optimization. John Wiley & Sons, Inc., USA, 1 edition, 2003.
  75. Simultaneous perturbation stochastic approximation for automatic speech recognition. In Proc. Interspeech 2013, pages 622–626, 2013.
  76. Bbtv2: Towards a gradient-free future with large language models. In Proceedings of EMNLP, 2022.
  77. Black-box tuning for language-model-as-a-service. In Proceedings of ICML, 2022.
  78. On the importance of initialization and momentum in deep learning. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
  79. Transfer learning without knowing: Reprogramming black-box machine learning models with scarce data and limited resources. In International Conference on Machine Learning, pages 9614–9624. PMLR, 2020.
  80. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  81. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  82. Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  83. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, Oct. 2020. Association for Computational Linguistics.
  84. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
  85. Offsite-tuning: Transfer learning without full model. arXiv preprint arXiv:2302.04870, 2023.
  86. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492, 2010.
  87. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
  88. Adversarial attacks and defenses in images, graphs and text: A review. International Journal of Automation and Computing, 17(2):151–178, 2020.
  89. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  90. Unified vision and language prompt learning. arXiv preprint arXiv:2210.07225, 2022.
  91. Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning, pages 12310–12320. PMLR, 2021.
  92. Analysis and improvement of policy gradient estimation. Advances in Neural Information Processing Systems, 24, 2011.
  93. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022.
  94. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Changdae Oh (12 papers)
  2. Hyeji Hwang (1 paper)
  3. Hee-young Lee (1 paper)
  4. YongTaek Lim (3 papers)
  5. Geunyoung Jung (3 papers)
  6. Jiyoung Jung (6 papers)
  7. Hosik Choi (5 papers)
  8. Kyungwoo Song (38 papers)
Citations (44)