Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression (2406.11430v4)

Published 17 Jun 2024 in cs.CL and cs.AI
A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

Abstract: The deployment of LLMs is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_2$ and the attention scores over cached KV pairs, where a low $L_2$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the $L_2$ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on LLMling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy. Moreover, without relying on the attention scores, this approach remains compatible with FlashAttention, enabling broader applicability.

A Simple and Effective L2L_2 Norm-Based Strategy for KV Cache Compression

The paper "A Simple and Effective L2L_2 Norm-Based Strategy for KV Cache Compression" presents a novel approach to addressing the extensive memory requirements associated with the Key-Value (KV) cache in LLMs, particularly as context lengths increase. The authors, Devoto et al., propose a heuristic that leverages the L2L_2 norm of key embeddings to compress the KV cache without sacrificing model accuracy.

Context and Motivation

In LLMs, particularly those built using decoder-only Transformer architectures, the KV cache plays a crucial role in storing keys and values derived from past tokens to avoid recomputation during generation. This cache enables efficient handling of long context dependencies, but as context length increases, so does the size of the KV cache, leading to higher memory usage and decoding latency. The authors' primary motivation is to mitigate this issue without requiring complex algorithms or significant modifications to the model architecture.

Key Observations and Methodology

The core observation underpinning this work is the high correlation between the L2L_2 norm of key embeddings (k2\|k\|_2) and their attention scores during decoding. Analysis of attention distributions across multiple layers of popular LLMs like Llama-2 reveals that key embeddings with lower L2L_2 norms are generally associated with higher attention scores. Based on this, the authors hypothesize that retaining only the keys with the lowest L2L_2 norms could effectively compress the KV cache without substantial loss of critical information.

The proposed strategy involves compressing the KV cache by retaining keys with the lowest L2L_2 norms and their corresponding values, thereby reducing the cache size significantly. This approach stands out because it does not require retraining the model or making architecture modifications, making it applicable to any transformer-based decoder-only LLM off-the-shelf.

Experimental Results

The authors evaluate their proposed method on several tasks:

  1. LLMling:
    • The experiments demonstrate that removing up to 50% of the KV cache based on the L2L_2 norm does not degrade perplexity. Results shown in Figure 3 indicate that perplexity remains stable until a large compression ratio is applied, which confirms the practicality of the method.
    • Additional evaluations using other strategies (keeping high L2L_2 norm keys and random compression) clearly indicate that the proposed approach of retaining low L2L_2 norm keys performs significantly better.
  2. Long-Context Modelling Tasks:
    • Needle-in-a-Haystack and Passkey Retrieval Tasks:
      • The method achieves impressive results, maintaining 99% accuracy on the needle-in-a-haystack task even with a 50% KV cache reduction (Figure 8).
      • For passkey retrieval, the method maintains 100% accuracy even when compressing 90% of the KV cache (Figure 9).
      • These results starkly contrast with other compression strategies, further validating the effectiveness of the L2L_2-norm-based approach.

Theoretical Analysis and Implications

The paper introduces the concept of Attention Loss (ALr) to measure the efficacy of the compression methodology. By defining attention loss as the sum of attention scores of dropped KV pairs, the authors quantitatively showcase the correlation between attention scores and L2L_2 norms (Equation 3).

The implications of these findings are substantial. The proposed method not only provides a simple, computationally efficient way to manage KV cache sizes but also paves the way for more accessible deployment of LLMs in hardware-constrained environments. This approach's practical benefits can notably enhance applications involving LLMs where long context handling is critical, such as document summarization, translation, and large-scale question-answering systems.

Future Directions

Future research could involve extending this heuristic to different model architectures and context lengths. Additionally, further theoretical exploration into why the L2L_2 norm correlates so strongly with attention scores could provide deeper insights, potentially uncovering more sophisticated strategies for memory management. Expanding the evaluation to larger models like Llama2-13b and Llama2-70b, as well as upcoming model architectures, would be valuable to verify the generalizability of these findings.

Conclusion

The paper by Devoto et al. contributes a straightforward yet effective strategy for KV cache compression in LLMs by leveraging the L2L_2 norm of key embeddings. Their approach, validated through rigorous experimentation on various tasks, demonstrates that significant memory footprint reductions are achievable without impacting model performance. This work has clear practical implications and opens up avenues for further research and optimization in efficient AI deployment.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Alessio Devoto (14 papers)
  2. Yu Zhao (207 papers)
  3. Simone Scardapane (79 papers)
  4. Pasquale Minervini (88 papers)
Citations (9)