Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AutoVP: An Automated Visual Prompting Framework and Benchmark (2310.08381v2)

Published 12 Oct 2023 in cs.CV and cs.LG

Abstract: Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach to adapting pre-trained vision models to solve various downstream image-classification tasks. However, there has hitherto been little systematic study of the design space of VP and no clear benchmark for evaluating its performance. To bridge this gap, we propose AutoVP, an end-to-end expandable framework for automating VP design choices, along with 12 downstream image-classification tasks that can serve as a holistic VP-performance benchmark. Our design space covers 1) the joint optimization of the prompts; 2) the selection of pre-trained models, including image classifiers and text-image encoders; and 3) model output mapping strategies, including nonparametric and trainable label mapping. Our extensive experimental results show that AutoVP outperforms the best-known current VP methods by a substantial margin, having up to 6.7% improvement in accuracy; and attains a maximum performance increase of 27.5% compared to linear-probing (LP) baseline. AutoVP thus makes a two-fold contribution: serving both as an efficient tool for hyperparameter tuning on VP design choices, and as a comprehensive benchmark that can reasonably be expected to accelerate VP's development. The source code is available at https://github.com/IBM/AutoVP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Reprogrammable-FL: Improving utility-privacy tradeoff in federated learning via model reprogramming. In First IEEE Conference on Secure and Trustworthy Machine Learning, 2023.
  2. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 1(3):4, 2022.
  3. Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 35:25005–25017, 2022.
  4. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
  5. E Oran Brigham. The fast Fourier transform and its applications. Prentice-Hall, Inc., 1988.
  6. Visual prompting for adversarial robustness. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023a.
  7. Understanding and improving visual prompting: A label-mapping perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  19133–19143, June 2023b.
  8. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  9. Pin-Yu Chen. Model reprogramming: Resource-efficient cross-domain machine learning. arXiv preprint arXiv:2202.10629, 2022.
  10. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
  11. Functional map of the world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  12. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  13. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019.
  14. Adversarial reprogramming of neural networks. In International Conference on Learning Representations, 2019.
  15. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111:98–136, 2015.
  16. Making pre-trained language models better few-shot learners. In Association for Computational Linguistics (ACL), 2021.
  17. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  18. On calibration of modern neural networks. In International conference on machine learning, pp. 1321–1330. PMLR, 2017.
  19. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  20. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
  21. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  22. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  23. Detection of traffic signs in real-world images: The german traffic sign detection benchmark. In The 2013 International Joint Conference on Neural Networks (IJCNN), pp.  1–8, 2013.
  24. NCTV: Neural Clamping Toolkit and Visualization for Neural Network Calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37 (13), pp.  16446–16448, Sep. 2023.
  25. Visual prompt tuning. In European Conference on Computer Vision, pp.  709–727. Springer, 2022.
  26. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2901–2910, 2017.
  27. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  28. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pp.  554–561, 2013.
  29. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, Ontario, 2009.
  30. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  3045–3059, Online and Punta Cana, Dominican Republic, 11 2021. Association for Computational Linguistics.
  31. Massively parallel hyperparameter tuning. arXiv preprint arXiv:1810.05934, 5, 2018.
  32. Rethinking visual prompt learning as masked visual token modeling. arXiv preprint arXiv:2303.04998, 2023.
  33. Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118, 2018.
  34. Design guidelines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp.  1–23, 2022.
  35. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  10012–10022, 2021.
  36. Prompt generation networks for efficient adaptation of frozen vision transformers. arXiv preprint arXiv:2210.06466, 2022.
  37. Exploring the limits of weakly supervised pretraining. In Proceedings of the European conference on computer vision (ECCV), pp.  181–196, 2018.
  38. Simple open-vocabulary object detection. In European Conference on Computer Vision, pp.  728–755. Springer, 2022.
  39. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  40. M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
  41. Blackvip: Black-box visual prompting for robust transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  24224–24235, June 2023.
  42. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pp.  3498–3505. IEEE, 2012.
  43. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  44. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831. PMLR, 2021.
  45. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  46. Kornia: an open source differentiable computer vision library for pytorch. In Winter Conference on Applications of Computer Vision, 2020.
  47. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  48. Prompt space optimizing few-shot reasoning success with large language models. arXiv preprint arXiv:2306.03799, 2023.
  49. Visual prompt tuning for generative transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19840–19851, 2023.
  50. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  51. NeuralFuse: Learning to Recover the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes. arXiv preprint arXiv:2306.16869, 2023.
  52. Neural clamping: Joint input perturbation and temperature scaling for neural network calibration. arXiv preprint arXiv:2209.11604, 2022.
  53. Transfer learning without knowing: Reprogramming black-box machine learning models with scarce data and limited resources. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  9614–9624. PMLR, 13–18 Jul 2020.
  54. Convolutional visual prompt for robust visual perception. In Advances in Neural Information Processing Systems, volume 36, pp.  27897–27921. Curran Associates, Inc., 2023.
  55. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5(1):1–9, 2018.
  56. Unleashing the power of visual prompting at the pixel level. arXiv preprint arXiv:2212.10556, 2022.
  57. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pp.  3485–3492. IEEE, 2010.
  58. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1492–1500, 2017.
  59. From visual prompt learning to zero-shot transfer: Mapping is all you need. arXiv preprint arXiv:2303.05266, 2023.
  60. Neural model reprogramming with similarity based mapping for low-resource spoken command classification. arXiv preprint arXiv:2110.03894, 2021.
  61. LogME: Practical assessment of pre-trained models for transfer learning. In International Conference on Machine Learning, pp. 12133–12143. PMLR, 2021.
  62. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16816–16825, 2022a.
  63. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
  64. Temperature balancing, layer-wise weight analysis, and neural network training. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
  65. A three-regime model of network pruning. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  42790–42809, 2023b.
Citations (13)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets