Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Black-Box Tuning of Vision-Language Models with Effective Gradient Approximation (2312.15901v1)

Published 26 Dec 2023 in cs.CV

Abstract: Parameter-efficient fine-tuning (PEFT) methods have provided an effective way for adapting large vision-LLMs to specific tasks or scenarios. Typically, they learn a very small scale of parameters for pre-trained models in a white-box formulation, which assumes model architectures to be known and parameters to be accessible. However, large models are often not open-source due to considerations of preventing abuse or commercial factors, hence posing a barrier to the deployment of white-box PEFT methods. To alleviate the dependence on model accessibility, we introduce collaborative black-box tuning (CBBT) for both textual prompt optimization and output feature adaptation for black-box models. Specifically, considering that the backpropagation gradients are blocked, we approximate the gradients of textual prompts by analyzing the predictions with perturbed prompts. Secondly, a lightweight adapter is deployed over the output feature of the inaccessible model, further facilitating the model adaptation process. Empowered with these designs, our CBBT is extensively evaluated on eleven downstream benchmarks and achieves remarkable improvements compared to existing black-box VL adaptation methods. Code is released at https://github.com/guozix/cbbt.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255.
  2. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198.
  3. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer.
  4. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613.
  5. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
  6. Black-box prompt learning for pre-trained language models. arXiv preprint arXiv:2201.08531.
  7. Disarm: An antithetic gradient estimator for binary latent variables. Advances in neural information processing systems, 33:18637–18647.
  8. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544.
  9. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evolutionary computation, 11(1):1–18.
  10. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226.
  11. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  12. Black-box adversarial attacks with limited queries and information. In International conference on machine learning, pages 2137–2146. PMLR.
  13. Prior convictions: Black-box adversarial attacks with bandits and priors. arXiv preprint arXiv:1807.07978.
  14. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR.
  15. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  16. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561.
  17. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE.
  18. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208.
  19. Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653.
  20. Imagination-augmented natural language understanding. arXiv preprint arXiv:2204.08535.
  21. Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215.
  22. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151.
  23. Adversarial prompting for black box foundation models. arXiv preprint arXiv:2302.04237.
  24. Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE.
  25. Blackvip: Black-box visual prompting for robust transfer learning. arXiv preprint arXiv:2303.14773.
  26. Black box few-shot adaptation for vision-language models. arXiv preprint arXiv:2304.01752.
  27. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE.
  28. Exploring low-dimensional intrinsic task subspace via prompt tuning. arXiv preprint arXiv:2110.07867.
  29. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
  30. Peter H Schönemann. 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1–10.
  31. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
  32. James C Spall. 1992. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE transactions on automatic control, 37(3):332–341.
  33. James C Spall. 1997. A one-measurement form of simultaneous perturbation stochastic approximation. Automatica, 33(1):109–112.
  34. James C Spall. 1998. An overview of the simultaneous perturbation method for efficient optimization. Johns Hopkins apl technical digest, 19(4):482–492.
  35. Bbtv2: Towards a gradient-free future with large language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3916–3930.
  36. Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning, pages 20841–20855. PMLR.
  37. Dualcoop: Fast adaptation to multi-label recognition with limited annotations. arXiv preprint arXiv:2206.09541.
  38. Transfer learning without knowing: Reprogramming black-box machine learning models with scarce data and limited resources. In International Conference on Machine Learning, pages 9614–9624. PMLR.
  39. Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 742–749.
  40. Learning to decompose visual features with latent textual prompts. arXiv preprint arXiv:2210.04287.
  41. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668.
  42. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949–980.
  43. Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, pages 5–32.
  44. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971.
  45. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE.
  46. Clip also understands text: Prompting clip for phrase understanding. arXiv preprint arXiv:2210.05836.
  47. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783.
  48. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962.
  49. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
  50. Tip-adapter: Training-free adaption of clip for few-shot classification. arXiv preprint arXiv:2207.09519.
  51. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825.
  52. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348.
  53. Efficient neural network training via forward and backward propagation sparsification. Advances in Neural Information Processing Systems, 34:15216–15229.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zixian Guo (5 papers)
  2. Yuxiang Wei (40 papers)
  3. Ming Liu (421 papers)
  4. Zhilong Ji (31 papers)
  5. Jinfeng Bai (31 papers)
  6. Yiwen Guo (58 papers)
  7. Wangmeng Zuo (279 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.