Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting (2306.00409v2)

Published 1 Jun 2023 in cs.CV

Abstract: Pre-trained LLMs (PLMs) have played an increasing role in multimedia research. In terms of vision-language (VL) tasks, they often serve as a language encoder and still require an additional fusion network for VL reasoning, resulting in excessive memory overhead. In this paper, we focus on exploring PLMs as a stand-alone model for VL reasoning tasks. Inspired by the recently popular prompt tuning, we first prove that the processed visual features can be also projected onto the semantic space of PLMs and act as prompt tokens to bridge the gap between single- and multi-modal learning. However, this solution exhibits obvious redundancy in visual information and model inference, and the placement of prompt tokens also greatly affects the final performance. Based on these observations, we further propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP). Concretely, DVP first deploys a cross-attention module to obtain text-related and compact visual prompt tokens, thereby greatly reducing the input length of PLMs. To obtain the optimal placement, we also equip DVP with a reinforcement-learning based search algorithm, which can automatically merge DVP with PLMs for different VL tasks via a very short search process. In addition, we also experiment DVP with the recently popular adapter approach to keep the most parameters of PLMs intact when adapting to VL tasks, helping PLMs achieve a quick shift between single- and multi-modal tasks. We apply DVP to two representative PLMs, namely BERT and T5, and conduct extensive experiments on a set of VL reasoning benchmarks including VQA2.0, GQA and SNLIVE. The experimental results not only show the advantage of DVP on efficiency and performance, but also confirm its superiority in adapting pre-trained LLMs to VL tasks.

PDF Abstract

Dynamic Visual Prompting: Efficient Transfer Learning for Vision-Language Tasks

Recent advances in the integration of pre-trained LLMs (PLMs) into vision-language (VL) tasks have demonstrated significant potential but also faced challenges of computational overhead and parameter redundancy. In this paper, the authors present "Dynamic Visual Prompting" (DVP), a novel approach aiming to surmount these limitations while maintaining the representational power of PLMs.

Methodological Contributions

The dynamic nature of the proposed visual prompting approach is central to the paper's contribution. Traditional VL models often facilitate a fusion of modalities through extensive and computation-heavy branches. In contrast, DVP leverages cross-attention mechanisms to dynamically generate text-related visual tokens, thereby bypassing the exhaustive utilization of all visual features. This crucial innovation shows a purported reduction in computational complexity by optimizing the input length to the PLMs.

One of the pivotal capabilities of DVP is its reinforced learning-based search algorithm, termed as $k$ -armed bandit based Automatic Prompt Placement (KAB-APP). This algorithm efficiently determines the optimal insertion points of visual prompts within the layers of PLMs, enhancing the adaptation process for a variety of tasks. By utilizing this approach, DVP can achieve a significant decrease in computational costs while ensuring performance gains, as evidenced by increases of up to 2.28% in accuracy and reductions of 80% in FLOPs, specifically on the VQA2.0 benchmark.

Empirical Evidence

The authors validate their proposed method across several representative datasets, including VQA2.0, GQA, SNLI-VE, and ScienceQA. The empirical evaluation demonstrates DVP's efficiency, notably when paired with Adapter techniques that allow for fine-tuning with minimal parameter updates. Extensive experiments indicate that DVP maintains competitive accuracy comparable with current state-of-the-art VLP models while dramatically decreasing the need for parameter updates (approximating 5-6% of the model parameters) and computational workload. This is particularly notable when applied to BERT, T5, and LLaMA, showcasing versatility across different architectures.

Implications and Future Directions

The successful demonstration of DVP lays the groundwork for more efficient and scalable adaptation of VL models. By reducing the need for extensive VL-specific pre-training and lowering computational demands, this approach has the potential to democratize the deployment of advanced PLMs in resource-constrained environments. Moreover, the use of reinforced learning techniques for prompt placement optimization opens up new avenues for automatic adaptation mechanisms within the AI field.

The authors' insights have clear implications for both theoretical understandings and practical applications. On the one hand, they encourage a renewed focus on efficient model adaptation strategies that can further bridge the gap between vision and language processing units. On the other hand, the apparent success of cross-attention mechanisms for token condensation potentially informs further architectural innovations in the continued evolution of PLMs.

Future research may proceed on several fronts, including extending dynamic prompting techniques to other multi-modal applications and investigating adaptive methods that can autonomously tailor prompts to diverse task requirements and data attributes. Furthermore, the integration of dynamic visual prompting into even larger LLMs may be explored to understand the scaling effects of the proposed method.

In summary, the introduction of Dynamic Visual Prompting presents a significant step forward in the pursuit of efficient, effective, and generalized PLM adaptation solutions for vision-language reasoning. Such research plays a crucial role in enhancing our understanding and capability within a rapidly advancing field of artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Shubin Huang (3 papers)
Qiong Wu (156 papers)
Yiyi Zhou (38 papers)
Weijie Chen (52 papers)
Rongsheng Zhang (36 papers)
Xiaoshuai Sun (91 papers)
Rongrong Ji (315 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos