A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression (2406.11430v4)

Published 17 Jun 2024 in cs.CL and cs.AI

Abstract: The deployment of LLMs is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_2$ and the attention scores over cached KV pairs, where a low $L_2$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the $L_2$ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on LLMling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy. Moreover, without relying on the attention scores, this approach remains compatible with FlashAttention, enabling broader applicability.

PDF HTML Abstract

A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression

The paper "A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression" presents a novel approach to addressing the extensive memory requirements associated with the Key-Value (KV) cache in LLMs, particularly as context lengths increase. The authors, Devoto et al., propose a heuristic that leverages the $L_2$ norm of key embeddings to compress the KV cache without sacrificing model accuracy.

Context and Motivation

In LLMs, particularly those built using decoder-only Transformer architectures, the KV cache plays a crucial role in storing keys and values derived from past tokens to avoid recomputation during generation. This cache enables efficient handling of long context dependencies, but as context length increases, so does the size of the KV cache, leading to higher memory usage and decoding latency. The authors' primary motivation is to mitigate this issue without requiring complex algorithms or significant modifications to the model architecture.

Key Observations and Methodology

The core observation underpinning this work is the high correlation between the $L_2$ norm of key embeddings ( $\|k\|_2$ ) and their attention scores during decoding. Analysis of attention distributions across multiple layers of popular LLMs like Llama-2 reveals that key embeddings with lower $L_2$ norms are generally associated with higher attention scores. Based on this, the authors hypothesize that retaining only the keys with the lowest $L_2$ norms could effectively compress the KV cache without substantial loss of critical information.

The proposed strategy involves compressing the KV cache by retaining keys with the lowest $L_2$ norms and their corresponding values, thereby reducing the cache size significantly. This approach stands out because it does not require retraining the model or making architecture modifications, making it applicable to any transformer-based decoder-only LLM off-the-shelf.

Experimental Results

The authors evaluate their proposed method on several tasks:

LLMling:
- The experiments demonstrate that removing up to 50% of the KV cache based on the $L_2$ norm does not degrade perplexity. Results shown in Figure 3 indicate that perplexity remains stable until a large compression ratio is applied, which confirms the practicality of the method.
- Additional evaluations using other strategies (keeping high $L_2$ norm keys and random compression) clearly indicate that the proposed approach of retaining low $L_2$ norm keys performs significantly better.
Long-Context Modelling Tasks:
- Needle-in-a-Haystack and Passkey Retrieval Tasks:
  - The method achieves impressive results, maintaining 99% accuracy on the needle-in-a-haystack task even with a 50% KV cache reduction (Figure 8).
  - For passkey retrieval, the method maintains 100% accuracy even when compressing 90% of the KV cache (Figure 9).
  - These results starkly contrast with other compression strategies, further validating the effectiveness of the $L_2$ -norm-based approach.

Theoretical Analysis and Implications

The paper introduces the concept of Attention Loss (ALr) to measure the efficacy of the compression methodology. By defining attention loss as the sum of attention scores of dropped KV pairs, the authors quantitatively showcase the correlation between attention scores and $L_2$ norms (Equation 3).

The implications of these findings are substantial. The proposed method not only provides a simple, computationally efficient way to manage KV cache sizes but also paves the way for more accessible deployment of LLMs in hardware-constrained environments. This approach's practical benefits can notably enhance applications involving LLMs where long context handling is critical, such as document summarization, translation, and large-scale question-answering systems.

Future Directions

Future research could involve extending this heuristic to different model architectures and context lengths. Additionally, further theoretical exploration into why the $L_2$ norm correlates so strongly with attention scores could provide deeper insights, potentially uncovering more sophisticated strategies for memory management. Expanding the evaluation to larger models like Llama2-13b and Llama2-70b, as well as upcoming model architectures, would be valuable to verify the generalizability of these findings.

Conclusion

The paper by Devoto et al. contributes a straightforward yet effective strategy for KV cache compression in LLMs by leveraging the $L_2$ norm of key embeddings. Their approach, validated through rigorous experimentation on various tasks, demonstrates that significant memory footprint reductions are achievable without impacting model performance. This work has clear practical implications and opens up avenues for further research and optimization in efficient AI deployment.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Alessio Devoto (14 papers)
Yu Zhao (207 papers)
Simone Scardapane (79 papers)
Pasquale Minervini (88 papers)

Citations (9)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/s_scardapane/status/1839302812817305782

https://twitter.com/PMinervini/status/1841113261619478862

https://twitter.com/devoto_alessio/status/1803507925132124288

https://twitter.com/PMinervini/status/1836891028239856052

https://twitter.com/s_scardapane/status/1804146463988908472

https://twitter.com/devoto_alessio/status/1838987935799984289