Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs (2310.01801v4)

Published 3 Oct 2023 in cs.CL

Abstract: In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for LLMs. Different from the conventional KV cache that retains key and value vectors for all context tokens, we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. Moreover, with the lightweight attention profiling used to guide the construction of the adaptive KV cache, FastGen can be deployed without resource-intensive fine-tuning or re-training. In our experiments across various asks, FastGen demonstrates substantial reduction on GPU memory consumption with negligible generation quality loss. We will release our code and the compatible CUDA kernel for reproducibility.

Overview of Adaptive KV Cache Compression for LLMs

The paper introduces an innovative method, "Adaptive KV Cache Compression for LLMs," which addresses the growing computational and memory demands associated with generative inference in LLMs. Traditional KV cache mechanisms in LLMs store key and value vectors for all tokens in the input context, which results in significant memory consumption, especially as the model size and generation length increase. This paper proposes a solution that dynamically compresses the KV cache by profiling attention structures and employing selective compression strategies tailored to the behavior of different attention heads within the model.

Methodology

  1. FastGen Framework: The proposed method, FastGen, leverages a dual-phase approach: model profiling followed by adaptive KV cache construction. During the profiling phase, structural patterns of attention modules are detected. These insights guide the construction of KV caches by adjusting the compression policy of each attention head dynamically throughout the token generation phase.
  2. Compression Strategies:
    • Special Tokens are retained for attention heads that focus on specific tokens.
    • Punctuation-based retention is applied to heads primarily centered on punctuation.
    • Local Context retention evicts long-range contexts for heads with a local focus.
    • Heavy Hitters (Frequency) are preserved based on accumulated attention scores.
    • Hybrid policies are formulated by combining these strategies, allowing for flexible adaptation to the model's structural nuances.
  3. Implementation: The approach integrates seamlessly with existing LLMs as a plug-and-play solution, negating the need for retraining or fine-tuning. This is accomplished through efficient attention profiling algorithms, enabling the method's deployment with minimal overhead.

Experimental Results

The effectiveness of FastGen is validated across several tasks, including math, coding, and reasoning challenges (e.g., GSM8k, HumanEval, and AlpacaEval). Noteworthy findings include:

  • Memory Reduction: FastGen compresses the KV cache significantly, achieving up to 56.7% memory reduction in larger models like the 65B parameter Llama 1, while maintaining 95% of attention score fidelity.
  • Performance: The trade-off analysis demonstrates that FastGen maintains a competitive win rate against full-cache models, sustaining generation quality with compressed memory footprints.
  • Latency Improvements: FastGen contributes to substantial end-to-end latency reductions of up to 55%, illustrating its practical value in real-world deployment scenarios.

Implications and Future Work

The adaptive approach to KV cache compression advances the field of efficient LLM deployment, particularly in resource-constrained environments. The ability to dynamically tailor attention head functions presents an avenue for reducing both computational load and energy consumption. Future research may focus on integrating these compression techniques with other model efficiency strategies, such as quantization or pruning. Additionally, exploring the adaptation of FastGen to newer models or architectures that employ different attention mechanisms could further extend its applicability.

In conclusion, the adaptive KV cache compression framework provides a robust, adaptable solution for enhancing the efficiency of LLMs, offering a practical pathway to scale large models sustainably without compromising performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Suyu Ge (12 papers)
  2. Yunan Zhang (13 papers)
  3. Liyuan Liu (49 papers)
  4. Minjia Zhang (54 papers)
  5. Jiawei Han (263 papers)
  6. Jianfeng Gao (344 papers)
Citations (133)