HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments (2408.10945v2)

Published 20 Aug 2024 in cs.CV and cs.AI

Abstract: High-resolution Vision-LLMs (VLMs) have been widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate excessive visual tokens due to encoding multiple partitions of the input image. Processing these excessive visual tokens is computationally challenging, especially in resource-constrained environments with commodity GPUs. To support high-resolution images while meeting resource constraints, we propose High-Resolution Early Dropping (HiRED), a token-dropping scheme that operates within a fixed token budget before the LLM stage. HiRED can be integrated with existing high-resolution VLMs in a plug-and-play manner, as it requires no additional training while still maintaining superior accuracy. We strategically use the vision encoder's attention in the initial layers to assess the visual content of each image partition and allocate the token budget accordingly. Then, using the attention in the final layer, we select the most important visual tokens from each partition within the allocated budget, dropping the rest. Empirically, when applied to LLaVA-Next-7B on NVIDIA TESLA P40 GPU, HiRED with a 20% token budget increases token generation throughput by 4.7, reduces first-token generation latency by 15 seconds, and saves 2.3 GB of GPU memory for a single inference. The code is available at https://github.com/hasanar1f/HiRED.

PDF HTML Abstract

HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-LLMs

The paper, "HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-LLMs in Resource-Constrained Environments," presents a novel approach aimed at optimizing the inference efficiency of high-resolution Vision-LLMs (VLMs). Recognizing the computational challenges posed by processing numerous visual tokens, especially in environments with limited computational resources, the authors propose a token-dropping scheme named High-Resolution Early Dropping (HiRED). This system strategically reduces the number of visual tokens prior to the LLM stage, thus enhancing resource efficiency without compromising model accuracy.

Key Motivations and Approach

High-resolution VLMs have become a cornerstone in handling multimodal tasks due to their ability to process detailed visual data. However, the conventional encoding processes these models employ lead to the generation of redundant visual tokens, substantially increasing computational demands. HiRED addresses this specific bottleneck by employing an attention-guided approach to drop less relevant tokens early in the processing pipeline. This methodology leverages the inherent attention mechanisms in vision encoders to allocate a fixed token budget more efficiently across image partitions.

HiRED operates through a two-phased approach:

Token Budget Distribution: It uses the initial layer's attention maps from the vision encoder to assess the visual content and allocate an appropriate number of tokens according to image content distribution.
Token Dropping: It employs the final layer's attention to select the most salient visual tokens, ensuring only the most informative ones are retained and forwarded to the LLM stage. This selection is guided by feature importance scores computed from the CLS token attention in the vision encoder.

Evaluation and Results

The empirical results, as detailed in the paper, highlight that HiRED significantly enhances token generation throughput (by approximately 4.7 times) while reducing token generation latency by 15 seconds and saving 2.3 GB of GPU memory on platforms like NVIDIA TESLA P40 GPUs. These improvements are achieved while maintaining competitive accuracy levels across various benchmarks. The authors tested their approach using high-resolution VLMs such as LLaVA-Next-7B, demonstrating that HiRED's performance was on par with fully-touted models, and in some tasks even exceeded those.

The paper also explores existing methodologies such as FastV, FlexAttention, and PruMerge, benchmarking HiRED against these alternatives. Notably, HiRED achieves notable improvements in transcription tasks, which are often sensitive to data resolution.

Practical and Theoretical Implications

Practically, the introduction of HiRED offers a compelling solution for deploying high-resolution VLMs in real-world scenarios where computational resources are constrained. The plug-and-play nature of HiRED facilitates its integration into existing systems without necessitating model retraining, thereby broadening its applicability.

Theoretically, the use of attention maps for early token filtering introduces new avenues for research into resource-efficient neural network processing. The insights into ViT's layer-specific attention characteristics further our understanding of how information is processed and represented across different layers of deep models.

Future Directions

The development of HiRED offers exciting possibilities for future exploration. Further investigation could include adapting HiRED to more varied multimodal platforms beyond those initially tested, enhancing its adaptability and efficiency across diverse datasets. Additionally, there is potential to explore adaptive token budgets that dynamically adjust based on image complexity, further optimizing the balance between computational efficiency and model performance.

Overall, this paper provides a significant contribution to the ongoing efforts to optimize VLMs for computational efficiency, making high-resolution image processing feasible in resource-limited environments. The strategic drop of visual tokens without loss of critical information sets a precedent for efficient machine learning model deployment and inspires further advances in this domain.