Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression (2503.02812v1)

Published 4 Mar 2025 in cs.CL and cs.AI

Abstract: Autoregressive LLMs rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors that allow us to efficiently approximate attention scores without computing the attention maps. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection. Contrarily to many alternatives, Q-Filters is compatible with FlashAttention, as it does not require direct access to attention weights. Experimental results in long-context settings demonstrate that Q-Filters is competitive with attention-based compression methods such as SnapKV in retrieval tasks while consistently outperforming efficient compression schemes such as Streaming-LLM in generation setups. Notably, Q-Filters achieves a 99% accuracy in the needle-in-a-haystack task with a x32 compression level while reducing the generation perplexity drop by up to 65% in text generation compared to Streaming-LLM.

Summary

  • The paper introduces Q-Filters, a novel training-free method leveraging QK vector geometry to efficiently compress KV cache in autoregressive LLMs during inference.
  • Q-Filters identifies less crucial KV pairs using projections onto a context-agnostic principal direction derived from Q-K vector anisotropy.
  • Evaluations show Q-Filters outperforms StreamingLLM and is competitive with attention-based methods like SnapKV, achieving state-of-the-art performance with significant compression.

The paper introduces Q-Filters, a novel training-free Key-Value (KV) cache compression method for autoregressive LLMs. The method leverages geometric properties of Query (Q) and Key (K) vectors to approximate attention scores, allowing for efficient filtering of less crucial KV pairs based on a context-agnostic projection.

The primary challenge addressed is the growing memory bottleneck caused by the KV cache in long-context LLMs. As context lengths increase, storing past hidden states becomes computationally expensive. Q-Filters aims to alleviate this by reducing the KV cache size during inference, without requiring model fine-tuning or access to attention weights, thus maintaining compatibility with FlashAttention.

The authors begin by observing the anisotropic nature of Query-Key representations. They posit two key observations:

  • There exists a direction uhSdH1u^h \in \mathbb{S}^{d_H - 1} and ϵ=±1\epsilon = \pm 1 such that E(Qih,uh)>0\mathbb{E}\left(\langle Q^h_i, u^h \rangle\right) > 0 and E(Kjh,ϵuh)>0\mathbb{E}\left(\langle K^h_j, \epsilon u^h \rangle\right) > 0, where ,\langle \cdot, \cdot \rangle denotes the dot product and SdH1\mathbb{S}^{d_H-1} is the dHd_H-dimensional hypersphere. This suggests a shared favored direction for QhQ^h and KhK^h distributions.
  • Given uh=arg maxuSdH1E(Qih,u)u^h = \argmax_{u \in \mathbb{S}^{d_H - 1}} \mathbb{E}\left(\langle Q^h_i, u \rangle\right) and an orthonormal basis B=(uh,u2,...,udH)B=(u^h, u_2, ..., u_{d_H}) of RdH\mathbb{R}^{d_H}, for all attention inputs XX: m[2,dH],E(Qih,um)0\forall m \in [2, d_H], \mathbb{E}\left(\langle Q^h_i, u_m \rangle \right) \approx 0. This implies that the anisotropy is largely uni-directional.

Based on these observations, the authors derive the following theorem:

Under the two previous assumptions, we have:

EQih(Qih,Kjh)κhKjh,uh\mathbb{E}_{Q^h_i}(\langle Q^h_i, K^h_j \rangle) \approx \kappa^h \langle K^h_j, u^h \rangle

where κh\kappa^h is a positive constant.

QihQ^h_i: Query vector for head hh at position ii

KjhK^h_j: Key vector for head hh at position jj

EQih\mathbb{E}_{Q^h_i}: Expectation over QihQ^h_i

Qih,Kjh\langle Q^h_i, K^h_j \rangle: Dot product of QihQ^h_i and KjhK^h_j

κh\kappa^h: Positive constant for head hh

uhu^h: The principal eigenvector of QhQ^h

This theorem suggests that the average unnormalized attention logits can be approximated by projecting KjhK^h_j onto the direction uhu^h.

The Q-Filters method consists of:

  1. Calculating Q-Filters:
    • Gather QhQ^h representations from the model.
    • Compute the Singular Value Decomposition (SVD) of the gathered representations at each layer and head, Qh=UΣV\mathcal{Q}^h = U \Sigma V^\top, with V=(v1,v2,...,vdH)V = (v_1, v_2, ..., v_{d_H}).
    • Obtain the positive right vector (or Q-Filter) for each head: v1+=sgn(1u1T)v1v_1^+ = \text{sgn}(\mathbf{1} u_1^T) v_1.
  2. Inference:
    • For each head, discard the KthK^h_t with the lowest Kth,v1+\langle K^h_t, v_1^+ \rangle value.

The authors validate Q-Filters on LLMing, needle-in-a-haystack, and Ruler tasks, using models like Llama-3.1-8B, Llama-3.1-70B and Qwen-2.5-7B. The method is compared against StreamingLLM, SnapKV, K-Norm, and ExpectedAttention. Results show that Q-Filters is competitive with attention-based compression methods like SnapKV in retrieval tasks and outperforms StreamingLLM in generation tasks. Key findings include:

  • Q-Filters consistently achieves the lowest perplexity in LLMing with a KV cache size limited to 512 pairs.
  • In the needle-in-a-haystack task, Q-Filters achieves significantly higher accuracy compared to K-Norm with 64x compression.
  • On the Ruler dataset, Q-Filters achieves the highest score with a 32x compression factor.

The authors also address the robustness of the calibration dataset used to compute Q-Filters. They find that increasing the number of samples in the calibration dataset improves performance, with diminishing returns beyond 1k samples. They also demonstrate that Q-Filter vectors exhibit stability across different calibration datasets.

Furthermore, the paper analyzes the time and memory overhead induced by Q-Filters. The storage overhead is deemed negligible compared to the total parameter count of the models. The method is also compatible with FlashAttention. Time to First Token (TTFT) measurements show that Q-Filters maintain a performance advantage even as the sequence length increases.

Finally, the authors acknowledge a limitation of the method: Q-Filters may not be as effective for models with significantly different attention mechanisms, such as those using QK-normalization.