Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ThinK: Thinner Key Cache by Query-Driven Pruning (2407.21018v2)

Published 30 Jul 2024 in cs.CL and cs.AI
ThinK: Thinner Key Cache by Query-Driven Pruning

Abstract: LLMs have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. However, their increased computational and memory demands present significant challenges, especially when handling long sequences. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. Unlike existing approaches that optimize the memory based on the sequence length, we identify substantial redundancy in the channel dimension of the KV cache, as indicated by an uneven magnitude distribution and a low-rank structure in the attention weights. In response, we propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels. Our approach not only maintains or enhances model accuracy but also achieves a reduction in KV cache memory costs by over 20% compared with vanilla KV cache eviction and quantization methods. For instance, ThinK integrated with KIVI can achieve a 2.8x reduction in peak memory usage while maintaining nearly the same quality, enabling up to a 5x increase in batch size when using a single GPU. Extensive evaluations on the LLaMA and Mistral models across various long-sequence datasets verified the efficiency of ThinK, establishing a new baseline algorithm for efficient LLM deployment without compromising performance.

ThinK: Thinner Key Cache by Query-Driven Pruning

The paper "ThinK: \underline{Thin}ner \underline{K}ey Cache by Query-Driven Pruning" addresses a significant challenge in managing the extensive memory and computational costs associated with LLMs during inference, particularly when handling long sequences. By proposing ThinK, a query-dependent key-value (KV) cache pruning method, the authors provide a novel approach to optimize memory usage while maintaining or enhancing model performance.

Motivation and Key Insights

LLMs have demonstrated impressive capabilities in natural language processing, achieving state-of-the-art performance in various applications such as document summarization, code generation, and conversational AI. However, the computational and memory overheads, especially with longer context sequences, impose substantial burdens due to the quadratic complexity of the transformer attention mechanism. This challenge calls for effective strategies to manage the KV cache, which grows linearly with batch size, sequence length, number of layers, heads, and channel size.

Previous methodologies primarily focused on either quantization or pruning based on token sparsity and inter-layer redundancies. However, the authors observed that the channel dimension of the KV cache is significantly underexplored despite its exhibiting notable redundancy. This redundancy is characterized by unbalanced magnitude distribution and a low-rank structure in attention weights.

Methodology: The ThinK Approach

Based on the identified channel redundancy, the authors propose ThinK, a query-driven KV cache pruning technique. Their approach involves the following key steps:

  1. Magnitude-Based Observation: They illustrate that certain channels exhibit considerable magnitudes, suggesting the potential for pruning less significant channels.
  2. Singular Value Analysis: Singular value decomposition (SVD) of attention scores reveals a low-rank structure, reinforcing the potential for effective channel pruning.
  3. Optimization Problem Formulation: The pruning task is framed as an optimization problem, aiming to minimize the attention weight loss due to pruning.
  4. Query-Dependent Pruning Criterion: The authors introduce a novel query-dependent criterion to evaluate the importance of each channel. Channels are selected using a greedy algorithm based on their contributions to attention weight.
  5. Implementation Considerations: ThinK integrates seamlessly with existing optimization techniques like FlashAttention and incorporates strategies to minimize computational costs.

Experimental Evaluation

The authors conducted extensive evaluations using LLaMA3 and Mistral models, testing ThinK on various long-sequence datasets from the LongBench benchmark. The results are compelling:

  • Memory Reduction: ThinK achieves over 20% reduction in KV cache memory costs compared to baseline methods like Heavy Hitter Oracle (H2O) and SnapKV.
  • Performance: The approach not only maintains, but in several cases, enhances model accuracy.
  • Robustness: ThinK demonstrates robust performance across different KV cache sizes and pruning ratios, retaining the ability to handle "Needle-in-a-Haystack" scenarios effectively.

Strong Numerical Results

ThinK's integration with H2O and SnapKV, which are state-of-the-art KV cache compression methods, shows that a 40% key cache channel pruning ratio can outperform methods without pruning. For instance, in the LongBench evaluation with a KV-size of 2048, ThinK reached or surpassed the performance levels of models with full-sized KV caches.

Implications and Future Directions

The practical implications of this research are profound. By significantly reducing memory and computational overheads, ThinK facilitates the more efficient deployment of LLMs in resource-constrained environments. This opens up greater accessibility for applications requiring the handling of long sequences or real-time processing.

Theorically, the paper pushes the boundaries of current understanding regarding channel redundancy in transformer models. It offers a fresh perspective on how query-specific evaluations can be leveraged for efficient model optimization.

Future Work: Future research could focus on enhancing the pruning ratio without performance degradation, further exploring value cache pruning, and evaluating the efficacy of more sophisticated compositional methods that combine both token-level and channel-level pruning criteria.

Conclusion

ThinK offers a compelling and efficient solution for managing the memory and computational demands of LLMs during inference. Its query-driven pruning technique sets a new precedent in the field by addressing the underexplored dimension of channel redundancy in KV caches. The method not only highlights significant memory savings but also maintains, if not enhances, model accuracy, thereby advancing both practical deployment and theoretical understanding of LLM optimization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245.
  2. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
  3. Eigen analysis of self-attention and its reconstruction from partial computation. arXiv preprint arXiv:2106.08823.
  4. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146.
  5. Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683.
  6. Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  8. Scatterbrain: Unifying sparse and low-rank attention. Advances in Neural Information Processing Systems, 34:17413–17426.
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  10. Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  11. Demmel, J. W. (1997). Applied numerical linear algebra. SIAM.
  12. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
  13. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
  14. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863.
  15. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398.
  16. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.
  17. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
  18. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801.
  19. Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. TechRxiv.
  20. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  21. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079.
  22. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  23. Dissecting the nvidia volta gpu architecture via microbenchmarking. arXiv preprint arXiv:1804.06826.
  24. Mistral 7b. arXiv preprint arXiv:2310.06825.
  25. Kamradt, G. (2023). Needle In A Haystack - pressure testing LLMs. Github.
  26. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  27. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
  28. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
  29. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469.
  30. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100.
  31. Minicache: Kv cache compression in depth dimension for large language models. arXiv preprint arXiv:2405.14366.
  32. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750.
  33. Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720.
  34. Meta, A. (2024). Introducing meta llama 3: The most capable openly available llm to date. Meta AI.
  35. OpenAI (2022). OpenAI: Introducing ChatGPT.
  36. OpenAI (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  37. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
  38. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  39. Loki: Low-rank keys for efficient sparse attention. arXiv preprint arXiv:2406.02542.
  40. Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13693–13696.
  41. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  43. Attention is all you need. Advances in neural information processing systems, 30.
  44. Efficient large language models: A survey. arXiv preprint arXiv:2312.03863, 1.
  45. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
  46. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  47. Layer-condensed kv cache for efficient inference of large language models. arXiv preprint arXiv:2405.10637.
  48. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
  49. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
  50. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039.
  51. Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv preprint arXiv:2309.14717.
  52. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532.
  53. Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12:39–57.
  54. Hibert: Document level pre-training of hierarchical bidirectional transformers for document summarization. arXiv preprint arXiv:1905.06566.
  55. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069.
  56. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yuhui Xu (28 papers)
  2. Zhanming Jie (13 papers)
  3. Hanze Dong (43 papers)
  4. Lei Wang (975 papers)
  5. Xudong Lu (17 papers)
  6. Aojun Zhou (45 papers)
  7. Amrita Saha (23 papers)
  8. Caiming Xiong (337 papers)
  9. Doyen Sahoo (47 papers)
Citations (5)
Youtube Logo Streamline Icon: https://streamlinehq.com