- The paper introduces Q-Filters, a novel training-free method leveraging QK vector geometry to efficiently compress KV cache in autoregressive LLMs during inference.
- Q-Filters identifies less crucial KV pairs using projections onto a context-agnostic principal direction derived from Q-K vector anisotropy.
- Evaluations show Q-Filters outperforms StreamingLLM and is competitive with attention-based methods like SnapKV, achieving state-of-the-art performance with significant compression.
The paper introduces Q-Filters, a novel training-free Key-Value (KV) cache compression method for autoregressive LLMs. The method leverages geometric properties of Query (Q) and Key (K) vectors to approximate attention scores, allowing for efficient filtering of less crucial KV pairs based on a context-agnostic projection.
The primary challenge addressed is the growing memory bottleneck caused by the KV cache in long-context LLMs. As context lengths increase, storing past hidden states becomes computationally expensive. Q-Filters aims to alleviate this by reducing the KV cache size during inference, without requiring model fine-tuning or access to attention weights, thus maintaining compatibility with FlashAttention.
The authors begin by observing the anisotropic nature of Query-Key representations. They posit two key observations:
- There exists a direction uh∈SdH−1 and ϵ=±1 such that E(⟨Qih,uh⟩)>0 and E(⟨Kjh,ϵuh⟩)>0, where ⟨⋅,⋅⟩ denotes the dot product and SdH−1 is the dH-dimensional hypersphere. This suggests a shared favored direction for Qh and Kh distributions.
- Given uh=u∈SdH−1argmaxE(⟨Qih,u⟩) and an orthonormal basis B=(uh,u2,...,udH) of RdH, for all attention inputs X: ∀m∈[2,dH],E(⟨Qih,um⟩)≈0. This implies that the anisotropy is largely uni-directional.
Based on these observations, the authors derive the following theorem:
Under the two previous assumptions, we have:
EQih(⟨Qih,Kjh⟩)≈κh⟨Kjh,uh⟩
where κh is a positive constant.
Qih: Query vector for head h at position i
Kjh: Key vector for head h at position j
EQih: Expectation over Qih
⟨Qih,Kjh⟩: Dot product of Qih and Kjh
κh: Positive constant for head h
uh: The principal eigenvector of Qh
This theorem suggests that the average unnormalized attention logits can be approximated by projecting Kjh onto the direction uh.
The Q-Filters method consists of:
- Calculating Q-Filters:
- Gather Qh representations from the model.
- Compute the Singular Value Decomposition (SVD) of the gathered representations at each layer and head, Qh=UΣV⊤, with V=(v1,v2,...,vdH).
- Obtain the positive right vector (or Q-Filter) for each head: v1+=sgn(1u1T)v1.
- Inference:
- For each head, discard the Kth with the lowest ⟨Kth,v1+⟩ value.
The authors validate Q-Filters on LLMing, needle-in-a-haystack, and Ruler tasks, using models like Llama-3.1-8B, Llama-3.1-70B and Qwen-2.5-7B. The method is compared against StreamingLLM, SnapKV, K-Norm, and ExpectedAttention. Results show that Q-Filters is competitive with attention-based compression methods like SnapKV in retrieval tasks and outperforms StreamingLLM in generation tasks. Key findings include:
- Q-Filters consistently achieves the lowest perplexity in LLMing with a KV cache size limited to 512 pairs.
- In the needle-in-a-haystack task, Q-Filters achieves significantly higher accuracy compared to K-Norm with 64x compression.
- On the Ruler dataset, Q-Filters achieves the highest score with a 32x compression factor.
The authors also address the robustness of the calibration dataset used to compute Q-Filters. They find that increasing the number of samples in the calibration dataset improves performance, with diminishing returns beyond 1k samples. They also demonstrate that Q-Filter vectors exhibit stability across different calibration datasets.
Furthermore, the paper analyzes the time and memory overhead induced by Q-Filters. The storage overhead is deemed negligible compared to the total parameter count of the models. The method is also compatible with FlashAttention. Time to First Token (TTFT) measurements show that Q-Filters maintain a performance advantage even as the sequence length increases.
Finally, the authors acknowledge a limitation of the method: Q-Filters may not be as effective for models with significantly different attention mechanisms, such as those using QK-normalization.