- The paper introduces DyCoke, which dynamically compresses tokens to reduce spatial and temporal redundancies in video processing.
- It employs a two-stage process that merges tokens via cosine similarity and dynamically prunes the KV cache during decoding for improved efficiency.
- Experiments demonstrate up to 1.5x speedup and 1.4x memory reduction, outperforming state-of-the-art methods while maintaining model performance.
Dynamic Compression of Tokens for Accelerating Video LLMs
The paper introduces DyCoke, a novel method for dynamic token compression in Video LLMs (VLLMs), designed to improve inference efficiency by addressing both spatial and temporal visual token redundancies. Such advancements are imperative given the computational demands posed by VLLMs, which process large volumes of visual tokens derived from video inputs. DyCoke is presented as a training-free approach, integrating seamlessly as a plug-and-play module without necessitating additional parameters or training, thus facilitating its incorporation into existing VLLMs.
The core contribution of DyCoke lies in its two-stage compression strategy. The first stage—Token Temporal Merging (TTM)—focuses on reducing redundancy across video frames by merging temporally similar tokens, which are often frequent due to the minimal changes between consecutive frames. By employing a sliding window approach to group tokens and employing cosine similarity to assess and merge redundant ones, TTM effectively curtails the number of tokens passed to the subsequent processing stages.
The second stage introduces a novel scheme of dynamic pruning within the Key-Value (KV) cache during the decoding phase. Here, visual tokens are dynamically retained or pruned based on their attention scores calculated against predicted tokens. This dynamic aspect addresses the inadequacies of static, one-shot pruning methods, which may mistakenly discard crucial tokens needed for comprehension of future video frames. By adopting a dynamic pruning approach, DyCoke ensures that the most relevant tokens are retained for each decoding step, enhancing focus and maintaining model performance.
Extensive experiments demonstrate that DyCoke consistently outperforms state-of-the-art counterparts such as FastV and PruMerge, achieving significant gains in computational efficiency—up to 1.5x speedup and 1.4x memory reduction—whilst maintaining or improving model performance across several video QA and reasoning benchmarks. Particularly, DyCoke's method for managing temporal redundancy allows for better exploitation of video structure, improving the cost-effectiveness of VLLM processing by allowing longer video sequences within the same computational budget.
The implications of DyCoke are twofold. Practically, it enables more efficient deployment of VLLMs, widening their applicability in resource-constrained environments. Theoretically, it underscores the importance of dynamic redundancy handling strategies, prompting further exploration in adaptability and efficiency for multimodal deep learning. Future explorations could delve into refining dynamic pruning strategies, potentially involving adaptive mechanisms that learn optimal pruning strategies over time or across diverse video contexts.
In sum, DyCoke presents a significant improvement for VLLMs by optimizing token processing through innovative temporal-spatial strategies, setting a precedent for future research in efficient multimodal AI systems. The method’s ability to enhance performance without additional training marks a practical advancement in the pursuit of scalable and adaptable AI models capable of handling complex video inputs.