DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models (2411.15024v3)

Published 22 Nov 2024 in cs.CV and cs.LG

Abstract: Video LLMs (VLLMs) have significantly advanced recently in processing complex video content, yet their inference efficiency remains constrained because of the high computational cost stemming from the thousands of visual tokens generated from the video inputs. We empirically observe that, unlike single image inputs, VLLMs typically attend visual tokens from different frames at different decoding iterations, making a one-shot pruning strategy prone to removing important tokens by mistake. Motivated by this, we present DyCoke, a training-free token compression method to optimize token representation and accelerate VLLMs. DyCoke incorporates a plug-and-play temporal compression module to minimize temporal redundancy by merging redundant tokens across frames, and applies dynamic KV cache reduction to prune spatially redundant tokens selectively. It ensures high-quality inference by dynamically retaining the critical tokens at each decoding step. Extensive experimental results demonstrate that DyCoke can outperform the prior SoTA counterparts, achieving 1.5X inference speedup, 1.4X memory reduction against the baseline VLLM, while still improving the performance, with no training.

Summary

The paper introduces DyCoke, which dynamically compresses tokens to reduce spatial and temporal redundancies in video processing.
It employs a two-stage process that merges tokens via cosine similarity and dynamically prunes the KV cache during decoding for improved efficiency.
Experiments demonstrate up to 1.5x speedup and 1.4x memory reduction, outperforming state-of-the-art methods while maintaining model performance.

Dynamic Compression of Tokens for Accelerating Video LLMs

The paper introduces DyCoke, a novel method for dynamic token compression in Video LLMs (VLLMs), designed to improve inference efficiency by addressing both spatial and temporal visual token redundancies. Such advancements are imperative given the computational demands posed by VLLMs, which process large volumes of visual tokens derived from video inputs. DyCoke is presented as a training-free approach, integrating seamlessly as a plug-and-play module without necessitating additional parameters or training, thus facilitating its incorporation into existing VLLMs.

The core contribution of DyCoke lies in its two-stage compression strategy. The first stage—Token Temporal Merging (TTM)—focuses on reducing redundancy across video frames by merging temporally similar tokens, which are often frequent due to the minimal changes between consecutive frames. By employing a sliding window approach to group tokens and employing cosine similarity to assess and merge redundant ones, TTM effectively curtails the number of tokens passed to the subsequent processing stages.

The second stage introduces a novel scheme of dynamic pruning within the Key-Value (KV) cache during the decoding phase. Here, visual tokens are dynamically retained or pruned based on their attention scores calculated against predicted tokens. This dynamic aspect addresses the inadequacies of static, one-shot pruning methods, which may mistakenly discard crucial tokens needed for comprehension of future video frames. By adopting a dynamic pruning approach, DyCoke ensures that the most relevant tokens are retained for each decoding step, enhancing focus and maintaining model performance.

Extensive experiments demonstrate that DyCoke consistently outperforms state-of-the-art counterparts such as FastV and PruMerge, achieving significant gains in computational efficiency—up to 1.5x speedup and 1.4x memory reduction—whilst maintaining or improving model performance across several video QA and reasoning benchmarks. Particularly, DyCoke's method for managing temporal redundancy allows for better exploitation of video structure, improving the cost-effectiveness of VLLM processing by allowing longer video sequences within the same computational budget.

The implications of DyCoke are twofold. Practically, it enables more efficient deployment of VLLMs, widening their applicability in resource-constrained environments. Theoretically, it underscores the importance of dynamic redundancy handling strategies, prompting further exploration in adaptability and efficiency for multimodal deep learning. Future explorations could delve into refining dynamic pruning strategies, potentially involving adaptive mechanisms that learn optimal pruning strategies over time or across diverse video contexts.

In sum, DyCoke presents a significant improvement for VLLMs by optimizing token processing through innovative temporal-spatial strategies, setting a precedent for future research in efficient multimodal AI systems. The method’s ability to enhance performance without additional training marks a practical advancement in the pursuit of scalable and adaptable AI models capable of handling complex video inputs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/papers_anon/status/1860898006960320777

https://twitter.com/gm8xx8/status/1860884602111377884

https://twitter.com/Im_mvish7/status/1936027501609767164