A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-LLMs
The paper detailed in the paper presents an innovative approach to tackling the challenges associated with deploying large pre-trained vision-language (VL) models, such as their enormous size and slow inference speeds, which complicate practical applications. The authors propose a method, FewVLM, which leverages prompt-based low-resource learning to enable more efficient task learning in VL models without the need for extensive datasets. This approach serves as an efficient alternative to traditional fine-tuning methods by using a smaller model architecture that relies on strategically crafted prompts to optimize performance, even with limited data.
FewVLM is pre-trained utilizing a sequence-to-sequence transformer model, incorporating both prefix LLMing (PrefixLM) and masked LLMing (MaskedLM) objectives. Through experimental evaluation, the authors demonstrate that FewVLM, despite being significantly smaller, outperforms larger models like Frozen and achieves performance comparable to models such as PICa, which are substantially larger. Specifically, FewVLM outperforms Frozen by an 18.2% accuracy point advantage in zero-shot visual question answering (VQA) and delivers comparable results to PICa, which is 246 times larger.
The paper explores the impact of prompt design on few-shot and zero-shot learning performance across various datasets, including VQAv2, OK-VQA, GQA, and captioning datasets such as NoCaps and Flickr30k. The findings reveal that prompts have a significant influence on zero-shot performances, particularly in VQA tasks. While substantial differences in performance are noted with different prompt designs, particularly in zero-shot settings, the influence of prompts diminishes as the size of the training data increases. This convergence towards optimal performance with increased data indicates the robustness of FewVLM to noisy prompts in more data-rich environments.
Investigations into the pre-training objectives show that different tasks benefit differently from MaskedLM and PrefixLM. MaskedLM tends to favor VQA tasks due to its span prediction capabilities, analogous to answering questions, whereas PrefixLM benefits captioning tasks by generating subsequent text from a given prefix, aligning well with caption generation objectives.
The implications of this research are valuable for both practical and theoretical advancements in VL models. Practically, FewVLM offers a viable solution for applications constrained by computational resources, enabling efficient deployment of VL models with competitive performance on task learning with few examples. Theoretically, the findings underscore the efficacy of prompt-based input modification as a mechanism for enhancing model adaptability and generalization in diverse contexts without excessive resource demands.
The potential for Future VL models lies in continued exploration of prompt design, including automated prompt generation methodologies. The broader application of FewVLM and the prompt-based learning paradigm could extend beyond the explored datasets, offering potent insights into the adaptability and scalability of VL models across various domains and tasks. This paper paves the way for refined approaches in leveraging minimal data for maximal performance in resource-constrained settings.