VPGTrans: Transfer Visual Prompt Generator across LLMs (2305.01278v2)

Published 2 May 2023 in cs.CV, cs.AI, and cs.CL

Abstract: While developing a new multimodal LLM (MLLM) by pre-training on tremendous image-text pairs from scratch can be exceedingly resource-consuming, connecting an existing LLM with a comparatively lightweight visual prompt generator (VPG) becomes a feasible paradigm. However, further tuning the VPG part of the MLLM still suffers from indispensable computational costs, i.e., requiring thousands of GPU hours and millions of training data. One alternative solution is to transfer an existing VPG from any existing MLLMs for the target MLLM. In this work, we for the first time investigate the VPG transferability across LLMs, and explore a solution to reduce the cost of VPG transfer. We first study the VPG transfer across different LLM sizes (e.g., small-to-large), and across different LLM types, through which we diagnose the key factors to maximize the transfer efficiency. Based on our observation, we design a two-stage transfer framework named VPGTrans, which is simple yet highly effective. Through extensive experiments, we demonstrate that VPGTrans helps significantly speed up the transfer learning process without compromising performance. Remarkably, it helps achieve the VPG transfer from BLIP-2 OPT$\text{2.7B}$ to BLIP-2 OPT$\text{6.7B}$ with over 10 times speed-up and 10.7% training data compared with connecting a VPG to OPT$_\text{6.7B}$ from scratch. Further, a series of intriguing findings and potential rationales behind them are provided and discussed. Finally, we showcase the practical value of our VPGTrans approach, by customizing two novel MLLMs, including VL-LLaMA and VL-Vicuna, with recently released LLaMA and Vicuna LLMs.

PDF Abstract

VPGTrans: Transfer Visual Prompt Generator Across LLMs

The paper "VPGTrans: Transfer Visual Prompt Generator Across LLMs" addresses the computationally intensive task of developing new multimodal LLMs (MLLMs). The authors propose a method for transferring visual prompt generators (VPGs) between different LLMs to reduce training costs significantly. This paper is particularly relevant in the context of combining multiple modalities, such as vision and language, which traditionally demands substantial computational resources.

Research Motivation and Problem Statement

Creating a multimodal LLM from scratch involves pre-training on vast volumes of image-text data, which is costly both in terms of time and computational resources. While coupling an existing LLM with a lightweight visual prompt generator provides a feasible alternative, the latter's tuning still incurs high costs. The paper introduces VPGTrans, a framework facilitating efficient VPG transfer across different LLMs, specifically addressing transfers across both different LLM sizes and types.

Methodology

The authors propose a two-stage transfer framework, VPGTrans, designed to maximize VPG transfer efficiency. The framework consists of:

Projector Warm-up: This stage involves initializing the projector using a combination of inherited VPG information and a word embedding converter. The converter aligns vocabularies between the source and target LLMs to initialize the projector with appropriate parameters. The projector is then warmed up using a large learning rate while keeping the VPG and base LLM frozen.
Vanilla Fine-tuning: In this stage, both the VPG and projector are fine-tuned jointly on the target LLM with a standard learning rate. This step ensures that the adapted VPG aligns well with the target LLM's architecture and data characteristics.

Experimental Results

The framework was tested on different setups, demonstrating its efficiency across two types of transfer: across various LLM sizes (TaS) and different model types (TaT). The experiments showcase significant reductions in training time and data requirements:

TaS: Transferring a VPG from a smaller to a larger LLM (e.g., BLIP-2 OPT $_\text{2.7B}$ to BLIP-2 OPT $_\text{6.7B}$ ) using VPGTrans resulted in over 10 times acceleration and only utilized 10.7% of the original data volume.
VPGTrans vs. Training from Scratch: VPGTrans consistently achieved comparable or superior performance with reduced computational cost, highlighting a marked efficiency improvement compared to training a VPG ab initio.

Insights and Discussion

Three major findings emerged from this research:

Smaller source LLMs often lead to more efficient VPG transfer, providing a practical approach in scenarios where larger LLM expansions are planned.
Larger models generally offer better transferability of VPG attributes across LLM types, likely due to more generalized and robust feature representations.
The VPGTrans framework's projector initialization stage plays a critical role in accommodating dimensional mismatches between LLMs, ensuring a seamless transfer adaptation process.

Practical Implications and Future Directions

VPGTrans presents an effective solution for deploying MLLMs with minimal retraining effort, making it accessible for researchers and developers aiming to leverage multimodal capabilities. This advancement paves the way for more agile experimentation and deployment of LLMs in various applications, from autonomous vehicles to conversational AI agents.

Speculating on future developments, enhancing cross-modality transfer techniques or exploring additional domain adaptations could further optimize the process. Moreover, investigating the framework's efficacy in contexts requiring real-time adaptability, such as online learning scenarios, constitutes an exciting avenue for future research.

In conclusion, VPGTrans represents a significant step towards efficient model development combining vision and language processing, proposing an innovative method for reducing the computational overhead inherent in state-of-the-art multimodal systems.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Ao Zhang (45 papers)
Hao Fei (105 papers)
Yuan Yao (292 papers)
Wei Ji (202 papers)
Li Li (655 papers)
Zhiyuan Liu (433 papers)
Tat-Seng Chua (359 papers)

Citations (78)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos