ThinK: Thinner Key Cache by Query-Driven Pruning (2407.21018v2)

Published 30 Jul 2024 in cs.CL and cs.AI

Abstract: LLMs have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. However, their increased computational and memory demands present significant challenges, especially when handling long sequences. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. Unlike existing approaches that optimize the memory based on the sequence length, we identify substantial redundancy in the channel dimension of the KV cache, as indicated by an uneven magnitude distribution and a low-rank structure in the attention weights. In response, we propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels. Our approach not only maintains or enhances model accuracy but also achieves a reduction in KV cache memory costs by over 20% compared with vanilla KV cache eviction and quantization methods. For instance, ThinK integrated with KIVI can achieve a 2.8x reduction in peak memory usage while maintaining nearly the same quality, enabling up to a 5x increase in batch size when using a single GPU. Extensive evaluations on the LLaMA and Mistral models across various long-sequence datasets verified the efficiency of ThinK, establishing a new baseline algorithm for efficient LLM deployment without compromising performance.

PDF Abstract

ThinK: Thinner Key Cache by Query-Driven Pruning

The paper "ThinK: \underline{Thin}ner \underline{K}ey Cache by Query-Driven Pruning" addresses a significant challenge in managing the extensive memory and computational costs associated with LLMs during inference, particularly when handling long sequences. By proposing ThinK, a query-dependent key-value (KV) cache pruning method, the authors provide a novel approach to optimize memory usage while maintaining or enhancing model performance.

Motivation and Key Insights

LLMs have demonstrated impressive capabilities in natural language processing, achieving state-of-the-art performance in various applications such as document summarization, code generation, and conversational AI. However, the computational and memory overheads, especially with longer context sequences, impose substantial burdens due to the quadratic complexity of the transformer attention mechanism. This challenge calls for effective strategies to manage the KV cache, which grows linearly with batch size, sequence length, number of layers, heads, and channel size.

Previous methodologies primarily focused on either quantization or pruning based on token sparsity and inter-layer redundancies. However, the authors observed that the channel dimension of the KV cache is significantly underexplored despite its exhibiting notable redundancy. This redundancy is characterized by unbalanced magnitude distribution and a low-rank structure in attention weights.

Methodology: The ThinK Approach

Based on the identified channel redundancy, the authors propose ThinK, a query-driven KV cache pruning technique. Their approach involves the following key steps:

Magnitude-Based Observation: They illustrate that certain channels exhibit considerable magnitudes, suggesting the potential for pruning less significant channels.
Singular Value Analysis: Singular value decomposition (SVD) of attention scores reveals a low-rank structure, reinforcing the potential for effective channel pruning.
Optimization Problem Formulation: The pruning task is framed as an optimization problem, aiming to minimize the attention weight loss due to pruning.
Query-Dependent Pruning Criterion: The authors introduce a novel query-dependent criterion to evaluate the importance of each channel. Channels are selected using a greedy algorithm based on their contributions to attention weight.
Implementation Considerations: ThinK integrates seamlessly with existing optimization techniques like FlashAttention and incorporates strategies to minimize computational costs.

Experimental Evaluation

The authors conducted extensive evaluations using LLaMA3 and Mistral models, testing ThinK on various long-sequence datasets from the LongBench benchmark. The results are compelling:

Memory Reduction: ThinK achieves over 20% reduction in KV cache memory costs compared to baseline methods like Heavy Hitter Oracle (H2O) and SnapKV.
Performance: The approach not only maintains, but in several cases, enhances model accuracy.
Robustness: ThinK demonstrates robust performance across different KV cache sizes and pruning ratios, retaining the ability to handle "Needle-in-a-Haystack" scenarios effectively.

Strong Numerical Results

ThinK's integration with H2O and SnapKV, which are state-of-the-art KV cache compression methods, shows that a 40% key cache channel pruning ratio can outperform methods without pruning. For instance, in the LongBench evaluation with a KV-size of 2048, ThinK reached or surpassed the performance levels of models with full-sized KV caches.

Implications and Future Directions

The practical implications of this research are profound. By significantly reducing memory and computational overheads, ThinK facilitates the more efficient deployment of LLMs in resource-constrained environments. This opens up greater accessibility for applications requiring the handling of long sequences or real-time processing.

Theorically, the paper pushes the boundaries of current understanding regarding channel redundancy in transformer models. It offers a fresh perspective on how query-specific evaluations can be leveraged for efficient model optimization.

Future Work: Future research could focus on enhancing the pruning ratio without performance degradation, further exploring value cache pruning, and evaluating the efficacy of more sophisticated compositional methods that combine both token-level and channel-level pruning criteria.

Conclusion

ThinK offers a compelling and efficient solution for managing the memory and computational demands of LLMs during inference. Its query-driven pruning technique sets a new precedent in the field by addressing the underexplored dimension of channel redundancy in KV caches. The method not only highlights significant memory savings but also maintains, if not enhances, model accuracy, thereby advancing both practical deployment and theoretical understanding of LLM optimization.