RazorAttention: Efficient KV-Cache Compression
- RazorAttention is a KV-cache compression algorithm that preserves semantic content by distinguishing between retrieval and non-retrieval attention heads.
- It retains full caches for globally attentive retrieval heads while compressing non-retrieval heads using local buffers and a compensation token mechanism.
- Empirical evaluations show over 70% reduction in KV-cache with negligible accuracy or latency impact, enabling efficient long-context inference on standard GPUs.
RazorAttention is a KV-cache compression algorithm for Transformer-based LLMs designed to substantially reduce memory and compute overhead during long-context inference. Unlike prior “token-dropping” or sliding window schemes, RazorAttention preserves all semantic content by exploiting the empirical distinction between globally-attending retrieval heads and locally-attending non-retrieval heads. RazorAttention achieves KV-cache reductions exceeding 70% on a range of LLMs with negligible impact on accuracy or latency, is training-free, and supports direct compatibility with FlashAttention (Tang et al., 22 Jul 2024).
1. Motivation: KV-Cache Bottlenecks in Long-Context Language Modeling
With context length increasing and inference workloads demanding prolonged generation windows (), standard Transformer inference stores—for every layer and attention head —two -dimensional vectors (Key and Value) per token. The total cache memory scales as
When is large, cache requirements routinely exceed the capacity of commonly available hardware (e.g., 24 GB consumer GPUs), and moving KV pairs becomes a major inference bottleneck. Existing mitigation strategies such as token-dropping irrevocably delete information and risk catastrophic failure if later tokens depend on earlier context. RazorAttention addresses the fundamental question of whether it is possible to shrink cache size by 70% while retaining all semantic information (Tang et al., 22 Jul 2024).
2. Empirical Head Typology: Retrieval Heads Versus Non-Retrieval Heads
Empirical analysis of Transformer attention distributions reveals that a small subset of attention heads, termed retrieval heads, maintain global attention and are responsible for recovering information from the entire context. The majority—non-retrieval heads—exhibit strong locality, attending primarily to a short window or to static “attention sinks”.
Identification Procedure in RoPE Models
- Echo Heads: Attend to the last previous instance of the current token.
- Induction Heads: Attend to the predecessor of the upcoming token, manifesting copy-induction patterns.
The identification protocol is as follows:
- Sample random tokens, replicate them 4× to suppress semantic structure.
- Execute a forward pass, record average attention weights to echo and induction positions.
- Mark as retrieval heads the top 1% by echo score and top 14% by induction score (see Table 3 in (Tang et al., 22 Jul 2024)).
For ALiBi positional encoding models, Theorem 1 (Eq 2) allows analytic calculation of head attention scope , and heads with highest are chosen as retrieval heads.
3. RazorAttention Algorithmic Structure
RazorAttention orchestrates per-head KV-cache compression as follows:
- Retrieval heads: Maintain the full -length KV-cache.
- Non-retrieval heads: Retain a local buffer of length plus a fixed number () of sink tokens; all other (remote) tokens are dropped.
- Compensation token mechanism: Dropped KV pairs for each non-retrieval head are averaged to yield a single pair , appended as a compensation token.
Compensation Token Formulation
For dropped indices , :
Upon query , the attention computation becomes:
This mechanism ensures the mean semantic content of remote tokens is accessible at all query timesteps.
Full Algorithm (RoPE Case)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
Input:
- Original per-head KV cache {K, V} of length N
- Retrieval-head mask R ⊆ {1…H}
- Compression ratio C (e.g. 5 ⇒ keep 1/5 of tokens)
- Minimum window S₀ (e.g. 4000)
- Number of sink tokens N₀ (e.g. 4)
for each head h ∉ R:
B_h ← max(S₀, N/C)
K_h_keep = {1…N₀} ∪ {(N−B_h+1)…N}
D_h = {1…N} \ K_h_keep
\hat k_h = mean_{m∈D_h}(k_{h,m})
\hat v_h = mean_{m∈D_h}(v_{h,m})
new K_h ← [K_h_keep ; \hat k_h]
new V_h ← [V_h_keep ; \hat v_h]
for each head h ∈ R:
keep full original K_h, V_h
Perform attention as usual, now on the compressed caches. |
4. Performance Metrics and Empirical Evaluation
RazorAttention delivers substantial reductions in KV-cache memory footprint, maintaining competitive accuracy and inference latency:
| Setting | Metric | Result |
|---|---|---|
| Llama 2-7B-80K | KV-Cache Compression | 70% |
| LongBench (16 tasks) | Accuracy drop | 0.2 pp |
| Needle-in-a-Haystack | Accuracy w.r.t full-KV | Within 1–2% |
| Hardware (24 GB GPU) | Latency/memory | Near baseline |
End-to-end benchmarks demonstrate that competing methods (e.g., StreamingLLM, H2O) either exceed memory budgets or incur severe accuracy loss at scale, while RazorAttention retains near-baseline latency and fits within commodity GPU constraints (Tang et al., 22 Jul 2024).
5. Compatibility with FlashAttention
The head-specific compression and compensation-token operations in RazorAttention are performed entirely at the cache management layer. The underlying FlashAttention kernel remains unchanged and interacts with a shorter KV-cache plus one additional token per non-retrieval head. This enables a plug-and-play deployment workflow in any inference pipeline already using FlashAttention, with no retraining or kernel modification necessary.
6. Limitations and Prospects for Future Research
RazorAttention employs fixed thresholds (1% echo, 14% induction) for retrieval head identification, which may require per-model retuning. While the current 70% reduction demonstrates significant potential, alternative compensation mechanisms utilizing higher-order moments or learnable aggregation tokens may enable further efficiency gains. The underlying reasons for why only a handful of heads become retrieval heads remain analytically unexplained. Extremely long context lengths () or repetitious data distributions may necessitate dynamic adaptation of buffer sizes () for optimal fidelity.
A plausible implication is that further research into head specialization could yield deeper insights into transformer dynamics and unlock more aggressive cache reduction strategies without loss in semantic recovery. RazorAttention thus leverages the transformer’s retrieve-then-process attention head specialization to achieve substantial resource savings alongside robust task performance in long-context LLM inference (Tang et al., 22 Jul 2024).