Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression (2503.02812v1)

Published 4 Mar 2025 in cs.CL and cs.AI

Abstract: Autoregressive LLMs rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors that allow us to efficiently approximate attention scores without computing the attention maps. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection. Contrarily to many alternatives, Q-Filters is compatible with FlashAttention, as it does not require direct access to attention weights. Experimental results in long-context settings demonstrate that Q-Filters is competitive with attention-based compression methods such as SnapKV in retrieval tasks while consistently outperforming efficient compression schemes such as Streaming-LLM in generation setups. Notably, Q-Filters achieves a 99% accuracy in the needle-in-a-haystack task with a x32 compression level while reducing the generation perplexity drop by up to 65% in text generation compared to Streaming-LLM.

Summary

The paper introduces Q-Filters, a novel training-free method leveraging QK vector geometry to efficiently compress KV cache in autoregressive LLMs during inference.
Q-Filters identifies less crucial KV pairs using projections onto a context-agnostic principal direction derived from Q-K vector anisotropy.
Evaluations show Q-Filters outperforms StreamingLLM and is competitive with attention-based methods like SnapKV, achieving state-of-the-art performance with significant compression.

The paper introduces Q-Filters, a novel training-free Key-Value (KV) cache compression method for autoregressive LLMs. The method leverages geometric properties of Query (Q) and Key (K) vectors to approximate attention scores, allowing for efficient filtering of less crucial KV pairs based on a context-agnostic projection.

The primary challenge addressed is the growing memory bottleneck caused by the KV cache in long-context LLMs. As context lengths increase, storing past hidden states becomes computationally expensive. Q-Filters aims to alleviate this by reducing the KV cache size during inference, without requiring model fine-tuning or access to attention weights, thus maintaining compatibility with FlashAttention.

The authors begin by observing the anisotropic nature of Query-Key representations. They posit two key observations:

There exists a direction $u^h \in \mathbb{S}^{d_H - 1}$ and $\epsilon = \pm 1$ such that $\mathbb{E}\left(\langle Q^h_i, u^h \rangle\right) > 0$ and $\mathbb{E}\left(\langle K^h_j, \epsilon u^h \rangle\right) > 0$ , where $\langle \cdot, \cdot \rangle$ denotes the dot product and $\mathbb{S}^{d_H-1}$ is the $d_H$ -dimensional hypersphere. This suggests a shared favored direction for $Q^h$ and $K^h$ distributions.
Given $u^h = \argmax_{u \in \mathbb{S}^{d_H - 1}} \mathbb{E}\left(\langle Q^h_i, u \rangle\right)$ and an orthonormal basis $B=(u^h, u_2, ..., u_{d_H})$ of $\mathbb{R}^{d_H}$ , for all attention inputs $X$ : $\forall m \in [2, d_H], \mathbb{E}\left(\langle Q^h_i, u_m \rangle \right) \approx 0$ . This implies that the anisotropy is largely uni-directional.

Based on these observations, the authors derive the following theorem:

Under the two previous assumptions, we have:

$\mathbb{E}_{Q^h_i}(\langle Q^h_i, K^h_j \rangle) \approx \kappa^h \langle K^h_j, u^h \rangle$

where $\kappa^h$ is a positive constant.

$Q^h_i$ : Query vector for head $h$ at position $i$

$K^h_j$ : Key vector for head $h$ at position $j$

$\mathbb{E}_{Q^h_i}$ : Expectation over $Q^h_i$

$\langle Q^h_i, K^h_j \rangle$ : Dot product of $Q^h_i$ and $K^h_j$

$\kappa^h$ : Positive constant for head $h$

$u^h$ : The principal eigenvector of $Q^h$

This theorem suggests that the average unnormalized attention logits can be approximated by projecting $K^h_j$ onto the direction $u^h$ .

The Q-Filters method consists of:

Calculating Q-Filters:
- Gather $Q^h$ representations from the model.
- Compute the Singular Value Decomposition (SVD) of the gathered representations at each layer and head, $\mathcal{Q}^h = U \Sigma V^\top$ , with $V = (v_1, v_2, ..., v_{d_H})$ .
- Obtain the positive right vector (or Q-Filter) for each head: $v_1^+ = \text{sgn}(\mathbf{1} u_1^T) v_1$ .
Inference:
- For each head, discard the $K^h_t$ with the lowest $\langle K^h_t, v_1^+ \rangle$ value.

The authors validate Q-Filters on LLMing, needle-in-a-haystack, and Ruler tasks, using models like Llama-3.1-8B, Llama-3.1-70B and Qwen-2.5-7B. The method is compared against StreamingLLM, SnapKV, K-Norm, and ExpectedAttention. Results show that Q-Filters is competitive with attention-based compression methods like SnapKV in retrieval tasks and outperforms StreamingLLM in generation tasks. Key findings include:

Q-Filters consistently achieves the lowest perplexity in LLMing with a KV cache size limited to 512 pairs.
In the needle-in-a-haystack task, Q-Filters achieves significantly higher accuracy compared to K-Norm with 64x compression.
On the Ruler dataset, Q-Filters achieves the highest score with a 32x compression factor.

The authors also address the robustness of the calibration dataset used to compute Q-Filters. They find that increasing the number of samples in the calibration dataset improves performance, with diminishing returns beyond 1k samples. They also demonstrate that Q-Filter vectors exhibit stability across different calibration datasets.

Furthermore, the paper analyzes the time and memory overhead induced by Q-Filters. The storage overhead is deemed negligible compared to the total parameter count of the models. The method is also compatible with FlashAttention. Time to First Token (TTFT) measurements show that Q-Filters maintain a performance advantage even as the sequence length increases.

Finally, the authors acknowledge a limitation of the method: Q-Filters may not be as effective for models with significantly different attention mechanisms, such as those using QK-normalization.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/s_scardapane/status/1897665928118091916

https://twitter.com/nthngdy/status/1897301411672809928

https://twitter.com/fly51fly/status/1898497495606727078

https://twitter.com/arxivsanitybot/status/1898004417766736245

https://twitter.com/snowclipsed/status/1898837757604298994