An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models (2403.06764v3)

Published 11 Mar 2024 in cs.CV, cs.AI, and cs.CL

Abstract: In this study, we identify the inefficient attention phenomena in Large Vision-LLMs (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.

PDF HTML Abstract

Plug-and-Play Inference Acceleration for Large Vision-LLMs: Introducing FastV

Efficient Processing of Visual Tokens in Large Vision-LLMs

The paper addresses a vital inefficiency in the handling of visual information by Large Vision-LLMs (LVLMs), with a special focus on renowned models such as LLaVA-1.5, QwenVL-Chat, and Video-LLaVA. Extensive analysis reveals that these models exhibit a markedly inefficient attention pattern towards visual tokens in their deeper layers, with these tokens receiving disproportionately lower attention scores than textual counterparts. This inefficiency signals a need for optimizing how LVLMs process visual data, promoting a shift towards a sparser, more efficient approach.

Introducing FastV: A Plug-and-Play Solution

The proposed FastV represents a ground-breaking solution aimed at enhancing the computational efficiency of LVLMs. By dynamically learning adaptive attention patterns in early layers and then selectively pruning visual tokens in subsequent layers, FastV significantly lowers computational costs. The method boasts a 45\% reduction in Floating Point Operations per Second (FLOPs) for the LLaVA-1.5-13B model, demonstrating its effectiveness without compromising task performance across a broad spectrum of image and video understanding tasks. This balance between computational efficiency and performance makes FastV an invaluable tool, especially for deploying LVLMs in resource-constrained environments like edge devices.

Theoretical and Practical Implications

From a practical standpoint, FastV opens up new avenues for deploying state-of-the-art LVLMs in scenarios where computational resources are limited. The solution’s scalability and flexibility, demonstrated by its capacity to adjust the trade-off between efficiency and performance based on specific needs, present a significant step forward in making advanced vision-language understanding models more accessible.

Theoretically, FastV contributes to the ongoing discourse on how LVLMs process multimodal information. By uncovering the inefficiencies in attention mechanisms of LVLMs and addressing them through token pruning, FastV sheds light on the underlying dynamics of visual data processing within these models. This insight is not only crucial for improving model efficiency but also for enhancing our understanding of the cognitive processes LVLMs employ when integrating visual and textual information.

A Look into the Future

As the field of artificial intelligence continues to evolve towards more integrated multimodal systems, FastV positions itself as a pivotal contribution that aligns with the trajectory towards more efficient and scalable vision-LLMs. Future developments could explore the extension of FastV’s principles to other types of multimodal data beyond visual tokens, potentially opening new frontiers in the quest for computationally efficient AI models that do not sacrifice performance. Moreover, the adaptability of FastV suggests exciting possibilities for customizing models to specific operational constraints, heralding a new era of personalized AI systems that can deliver top-tier performance tailored to individual needs.

In conclusion, FastV marks a significant advancement in the optimization of LVLMs, offering a promising path towards overcoming the computational bottlenecks that have hindered the wider deployment of these models. By striking a delicate balance between efficiency and performance, FastV not only enhances the practical applicability of LVLMs but also provides a novel perspective on their operational dynamics, laying the groundwork for future innovations in the field of artificial intelligence.