Exploring the Transferability of Visual Prompting for Multimodal Large Language Models (2404.11207v1)
Abstract: Although Multimodal LLMs (MLLMs) have demonstrated promising versatile capabilities, their performance is still inferior to specialized models on downstream tasks, which makes adaptation necessary to enhance their utility. However, fine-tuning methods require independent training for every model, leading to huge computation and memory overheads. In this paper, we propose a novel setting where we aim to improve the performance of diverse MLLMs with a group of shared parameters optimized for a downstream task. To achieve this, we propose Transferable Visual Prompting (TVP), a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts, including 1) Feature Consistency Alignment: which imposes constraints to the prompted feature changes to maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which encourages the prompted images to contain richer task-specific semantics with language guidance. We validate the effectiveness of TVP through extensive experiments with 6 modern MLLMs on a wide variety of tasks ranging from object recognition and counting to multimodal reasoning and hallucination correction.
- Visualglm-6b. https://github.com/THUDM/VisualGLM-6B/, 2023.
- Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, pages 23716–23736, 2022.
- Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 2022.
- When does diversity help generalization in classification ensembles? IEEE Transactions on Cybernetics, 52(9):9059–9075, 2021.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Food-101 – mining discriminative components with random forests. In Computer Vision – ECCV 2014, pages 446–461, 2014.
- Rethinking model ensemble in transfer-based adversarial attacks. arXiv preprint arXiv:2303.09105, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- Boosting adversarial attacks with momentum. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9185–9193, 2018.
- Evading defenses to transferable adversarial examples by translation-invariant attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4312–4321, 2019.
- How robust is google’s bard to adversarial image attacks? arXiv preprint arXiv:2309.11751, 2023.
- GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
- Adversarial reprogramming of neural networks. In International Conference on Learning Representations, 2019.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394, 2023.
- Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, 2019.
- Parameter-efficient transfer learning for nlp. In Proceedings of the 36th International Conference on Machine Learning, pages 2790–2799, 2019.
- Jeremy Howard. imagenette. "https://github.com/fastai/imagenette/".
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Bliva: A simple multimodal llm for better handling of text-rich visual questions. arXiv preprint arXiv:2308.09936, 2023.
- Diversity-aware meta visual prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10878–10887, 2023.
- Cats and dogs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3498–3505, 2012.
- Visual prompt tuning. In Computer Vision – ECCV 2022, pages 709–727, 2022.
- Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2901–2910, 2017.
- Simple but effective: Clip embeddings for embodied ai. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14829–14838, 2022.
- Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19113–19122, 2023a.
- Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15190–15200, 2023b.
- The hateful memes challenge: Detecting hate speech in multimodal memes. In Advances in Neural Information Processing Systems, pages 2611–2624, 2020.
- Selfreg: Self-supervised contrastive regularization for domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9619–9628, 2021.
- Learning multiple layers of features from tiny images. 2009.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023b.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023d.
- Microsoft coco: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, 2014.
- Visual instruction tuning. In Advances in Neural Information Processing Systems, 2023a.
- P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022.
- Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374, 2023b.
- Cheap and quick: Efficient vision-language instruction tuning for large language models. In Advances in Neural Information Processing Systems, pages 29615–29627, 2023.
- Fine-grained visual classification of aircraft. Technical report, 2013.
- Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
- Blackvip: Black-box visual prompting for robust transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24224–24235, 2023.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763, 2021.
- Chatgpt and open-ai models: A preliminary review. Future Internet, 15(6):192, 2023.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
- How much can clip benefit vision-and-language tasks? In International Conference on Learning Representations, 2022.
- Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Transfer learning without knowing: Reprogramming black-box machine learning models with scarce data and limited resources. In Proceedings of the 37th International Conference on Machine Learning, pages 9614–9624, 2020.
- Generalization error of ensemble estimators. In Proceedings of International Conference on Neural Networks (ICNN’96), pages 90–95. IEEE, 1996.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Twan Van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022.
- Unleashing the power of visual prompting at the pixel level. arXiv preprint arXiv:2212.10556, 2022.
- Parameter and computation efficient transfer learning for vision-language pre-trained models. In Advances in Neural Information Processing Systems, pages 41034–41050, 2023a.
- Quantifying privacy risks of prompts in visual prompt learning. arXiv preprint arXiv:2310.11970, 2023b.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
- A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313, 2023.
- Transfer visual prompt generator across llms. In Advances in Neural Information Processing Systems, 2023.
- Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915, 2023.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16816–16825, 2022a.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
- Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4396–4415, 2023.
- Transferable adversarial perturbations. In Computer Vision – ECCV 2018, pages 452–467, 2018.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.