Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models (2110.08484v2)

Published 16 Oct 2021 in cs.CV and cs.CL

Abstract: Large pre-trained vision-language (VL) models can learn a new task with a handful of examples and generalize to a new task without fine-tuning. However, these VL models are hard to deploy for real-world applications due to their impractically huge sizes and slow inference speed. To solve this limitation, we study prompt-based low-resource learning of VL tasks with our proposed method, FewVLM, relatively smaller than recent few-shot learners. For FewVLM, we pre-train a sequence-to-sequence transformer model with prefix LLMing (PrefixLM) and masked LLMing (MaskedLM). Furthermore, we analyze the effect of diverse prompts for few-shot tasks. Experimental results on VQA show that FewVLM with prompt-based learning outperforms Frozen which is 31x larger than FewVLM by 18.2% point and achieves comparable results to a 246x larger model, PICa. In our analysis, we observe that (1) prompts significantly affect zero-shot performance but marginally affect few-shot performance, (2) models with noisy prompts learn as quickly as hand-crafted prompts given larger training data, and (3) MaskedLM helps VQA tasks while PrefixLM boosts captioning performance. Our code is publicly available at \url{https://github.com/woojeongjin/FewVLM}

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Woojeong Jin (17 papers)
  2. Yu Cheng (354 papers)
  3. Yelong Shen (83 papers)
  4. Weizhu Chen (128 papers)
  5. Xiang Ren (194 papers)
Citations (114)

Summary

A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-LLMs

The paper detailed in the paper presents an innovative approach to tackling the challenges associated with deploying large pre-trained vision-language (VL) models, such as their enormous size and slow inference speeds, which complicate practical applications. The authors propose a method, FewVLM, which leverages prompt-based low-resource learning to enable more efficient task learning in VL models without the need for extensive datasets. This approach serves as an efficient alternative to traditional fine-tuning methods by using a smaller model architecture that relies on strategically crafted prompts to optimize performance, even with limited data.

FewVLM is pre-trained utilizing a sequence-to-sequence transformer model, incorporating both prefix LLMing (PrefixLM) and masked LLMing (MaskedLM) objectives. Through experimental evaluation, the authors demonstrate that FewVLM, despite being significantly smaller, outperforms larger models like Frozen and achieves performance comparable to models such as PICa, which are substantially larger. Specifically, FewVLM outperforms Frozen by an 18.2% accuracy point advantage in zero-shot visual question answering (VQA) and delivers comparable results to PICa, which is 246 times larger.

The paper explores the impact of prompt design on few-shot and zero-shot learning performance across various datasets, including VQAv2, OK-VQA, GQA, and captioning datasets such as NoCaps and Flickr30k. The findings reveal that prompts have a significant influence on zero-shot performances, particularly in VQA tasks. While substantial differences in performance are noted with different prompt designs, particularly in zero-shot settings, the influence of prompts diminishes as the size of the training data increases. This convergence towards optimal performance with increased data indicates the robustness of FewVLM to noisy prompts in more data-rich environments.

Investigations into the pre-training objectives show that different tasks benefit differently from MaskedLM and PrefixLM. MaskedLM tends to favor VQA tasks due to its span prediction capabilities, analogous to answering questions, whereas PrefixLM benefits captioning tasks by generating subsequent text from a given prefix, aligning well with caption generation objectives.

The implications of this research are valuable for both practical and theoretical advancements in VL models. Practically, FewVLM offers a viable solution for applications constrained by computational resources, enabling efficient deployment of VL models with competitive performance on task learning with few examples. Theoretically, the findings underscore the efficacy of prompt-based input modification as a mechanism for enhancing model adaptability and generalization in diverse contexts without excessive resource demands.

The potential for Future VL models lies in continued exploration of prompt design, including automated prompt generation methodologies. The broader application of FewVLM and the prompt-based learning paradigm could extend beyond the explored datasets, offering potent insights into the adaptability and scalability of VL models across various domains and tasks. This paper paves the way for refined approaches in leveraging minimal data for maximal performance in resource-constrained settings.