[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster (2412.01818v1)

Published 2 Dec 2024 in cs.CV and cs.AI

Abstract: Large vision-LLMs (VLMs) often rely on a substantial number of visual tokens when interacting with LLMs, which has proven to be inefficient. Recent efforts have aimed to accelerate VLM inference by pruning visual tokens. Most existing methods assess the importance of visual tokens based on the text-visual cross-attentions in LLMs. In this study, we find that the cross-attentions between text and visual tokens in LLMs are inaccurate. Pruning tokens based on these inaccurate attentions leads to significant performance degradation, especially at high reduction ratios. To this end, we introduce FasterVLM, a simple yet effective training-free visual token pruning method that evaluates the importance of visual tokens more accurately by utilizing attentions between the [CLS] token and image tokens from the visual encoder. Since FasterVLM eliminates redundant visual tokens immediately after the visual encoder, ensuring they do not interact with LLMs and resulting in faster VLM inference. It is worth noting that, benefiting from the accuracy of [CLS] cross-attentions, FasterVLM can prune 95\% of visual tokens while maintaining 90\% of the performance of LLaVA-1.5-7B. We apply FasterVLM to various VLMs, including LLaVA-1.5, LLaVA-NeXT, and Video-LLaVA, to demonstrate its effectiveness. Experimental results show that our FasterVLM maintains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing text-visual attention-based methods. Our code is available at https://github.com/Theia-4869/FasterVLM.

PDF HTML Abstract

Analysis of "Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster"

The paper addresses a significant challenge associated with the efficient inference of large vision-LLMs (VLMs). Such models, leveraging architectures like CLIP for visual encoding and LLMs such as LLaVA for language generation, often encounter computational bottlenecks due to long sequences of visual tokens. This work introduces a novel method termed FasterVLM, positing a marked improvement in the pruning of visual tokens for acceleration of VLM inference.

Central Contribution and Findings

The primary contention in the paper is that existing methods, which rely heavily on text-visual cross-attentions in LLMs for pruning, are misaligned with the actual significance of visual tokens. This misalignment reportedly results in attention shift and attention dispersion phenomena. Attention shift describes a bias in textual attention to focus more heavily on visual tokens later in the sequence, potentially bypassing critical initial visual information. Attention dispersion indicates a lack of concentration in LLM attention, where many visual tokens receive relatively diffuse attention scores. FasterVLM is proposed to counteract these problems by utilizing cross-attentions between the [CLS] token and image tokens from the visual encoder.

The approach outlined is training-free, applying the pruning process prior to engagement with the LLM, thus achieving substantial reductions in processing time while maintaining high levels of performance. Notably, the ability to prune approximately 95% of visual tokens while retaining 90% of the performance of LLaVA-1.5-7B is highlighted, underscoring the efficacy of the [CLS]-based attention methodology.

Experimental Validation

Empirical evidence is provided through rigorous experimentation across various VLM architectures such as LLaVA, LLaVA-NeXT, and Video-LLaVA. The results demonstrate consistent superiority of FasterVLM over text-visual attention-based approaches at high reduction ratios. Crucially, FasterVLM maintains robustness in performance across a range of multi-modal benchmarks, achieving considerable reductions in FLOPs and computational overhead. For instance, experiments underline that at a 95% token pruning ratio, a performance retention of nearly 89.41% is achieved.

Broader Implications

On a practical level, the introduction of FasterVLM advances the ability of VLMs to operate efficiently even under constrained resource settings, making them more viable for deployment in real-world applications that require rapid, on-the-fly inference. From a theoretical standpoint, this paradigm challenges the prevailing dependence on text-visual cross-attentions, advocating instead for a more visually centered attention strategy using [CLS] tokens. This could stimulate new directions in the optimization of VLM architectures and potentially impact how multi-modal integrations are approached in both academic research and industry applications.

Future Directions

The research opens avenues for further investigation into the nuances of attention mechanisms in VLMs. Exploring the dynamics of attention shift and dispersion at various layers of processing, integrating more sophisticated alignment methods between visual and language modalities, and enhancing token pruning strategies to accommodate varying input data types beyond static images, such as dynamic video frames, are plausible future pursuits.

In conclusion, this paper offers a substantive contribution to the domain of efficient multi-modal inference, advocating for a shift in how significance is derived from visual information in the context of language-vision interaction. The FasterVLM method's ability to streamline the inference process without sacrificing performance paves the way for more agile and effective VLM applications.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Qizhe Zhang (12 papers)
Aosong Cheng (4 papers)
Ming Lu (157 papers)
Zhiyong Zhuo (1 paper)
Minqi Wang (3 papers)
Jiajun Cao (10 papers)
Shaobo Guo (5 papers)
Qi She (37 papers)
Shanghang Zhang (172 papers)

Related Papers

Find Related Papers

GitHub

GitHub - Theia-4869/FasterVLM: Official code for paper: [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster. (8 stars)

Tweets

https://twitter.com/rohanpaul_ai/status/1865338044653687040

https://twitter.com/gabriberton/status/1935942160895164532

https://twitter.com/gm8xx8/status/1864141205442867406

https://twitter.com/_selebou/status/1929024922505884062