RazorAttention: Efficient KV-Cache Compression

Updated 3 December 2025

RazorAttention is a KV-cache compression algorithm that preserves semantic content by distinguishing between retrieval and non-retrieval attention heads.
It retains full caches for globally attentive retrieval heads while compressing non-retrieval heads using local buffers and a compensation token mechanism.
Empirical evaluations show over 70% reduction in KV-cache with negligible accuracy or latency impact, enabling efficient long-context inference on standard GPUs.

RazorAttention is a KV-cache compression algorithm for Transformer-based LLMs designed to substantially reduce memory and compute overhead during long-context inference. Unlike prior “token-dropping” or sliding window schemes, RazorAttention preserves all semantic content by exploiting the empirical distinction between globally-attending retrieval heads and locally-attending non-retrieval heads. RazorAttention achieves KV-cache reductions exceeding 70% on a range of LLMs with negligible impact on accuracy or latency, is training-free, and supports direct compatibility with FlashAttention (Tang et al., 2024).

1. Motivation: KV-Cache Bottlenecks in Long-Context Language Modeling

With context length $N$ increasing and inference workloads demanding prolonged generation windows ( $N \geq 32,000$ ), standard Transformer inference stores—for every layer $L$ and attention head $h$ —two $d$ -dimensional vectors (Key and Value) per token. The total cache memory scales as

$\mathrm{Memory} \propto L \times H \times N \times d$

When $N$ is large, cache requirements routinely exceed the capacity of commonly available hardware (e.g., 24 GB consumer GPUs), and moving KV pairs becomes a major inference bottleneck. Existing mitigation strategies such as token-dropping irrevocably delete information and risk catastrophic failure if later tokens depend on earlier context. RazorAttention addresses the fundamental question of whether it is possible to shrink cache size by $>$ 70% while retaining all semantic information (Tang et al., 2024).

2. Empirical Head Typology: Retrieval Heads Versus Non-Retrieval Heads

Empirical analysis of Transformer attention distributions reveals that a small subset of attention heads, termed retrieval heads, maintain global attention and are responsible for recovering information from the entire context. The majority—non-retrieval heads—exhibit strong locality, attending primarily to a short window or to static “attention sinks”.

Identification Procedure in RoPE Models

Echo Heads: Attend to the last previous instance of the current token.
Induction Heads: Attend to the predecessor of the upcoming token, manifesting copy-induction patterns.

The identification protocol is as follows:

Sample $K = 2500$ random tokens, replicate them 4× to suppress semantic structure.
Execute a forward pass, record average attention weights to echo and induction positions.
Mark as retrieval heads the top 1% by echo score and top 14% by induction score (see Table 3 in (Tang et al., 2024)).

For ALiBi positional encoding models, Theorem 1 (Eq 2) allows analytic calculation of head attention scope $L_h$ , and heads with highest $L_h$ are chosen as retrieval heads.

3. RazorAttention Algorithmic Structure

RazorAttention orchestrates per-head KV-cache compression as follows:

Retrieval heads: Maintain the full $N$ -length KV-cache.
Non-retrieval heads: Retain a local buffer of length $B_h$ plus a fixed number ( $N_0$ ) of sink tokens; all other (remote) tokens are dropped.
Compensation token mechanism: Dropped KV pairs for each non-retrieval head are averaged to yield a single pair $(\hat k, \hat v)$ , appended as a compensation token.

Compensation Token Formulation

For dropped indices $\mathcal{D} \subseteq \{1, \ldots, N\}$ , $N_d = |\mathcal{D}|$ :

$\hat k = \frac{1}{N_d}\sum_{m \in \mathcal{D}} k_m, \qquad \hat v = \frac{1}{N_d}\sum_{m \in \mathcal{D}} v_m$

Upon query $q_m$ , the attention computation becomes:

$\mathrm{Attn}(q_m; [K_{\text{kept}},\, \hat k], [V_{\text{kept}},\, \hat v]) = \frac{\sum_{n \notin \mathcal{D}} e^{q_m k_n^\top} v_n + N_d e^{q_m \hat k^\top} \hat v} {\sum_{n \notin \mathcal{D}} e^{q_m k_n^\top} + N_d e^{q_m \hat k^\top}}$

This mechanism ensures the mean semantic content of remote tokens is accessible at all query timesteps.

Full Algorithm (RoPE Case)

Input:
  - Original per-head KV cache {K, V} of length N
  - Retrieval-head mask R ⊆ {1…H}
  - Compression ratio C (e.g. 5 ⇒ keep 1/5 of tokens)
  - Minimum window S₀ (e.g. 4000)
  - Number of sink tokens N₀ (e.g. 4)

for each head h ∉ R:
  B_h ← max(S₀, N/C)
  K_h_keep = {1…N₀} ∪ {(N−B_h+1)…N}
  D_h = {1…N} \ K_h_keep
  \hat k_h = mean_{m∈D_h}(k_{h,m})
  \hat v_h = mean_{m∈D_h}(v_{h,m})
  new K_h ← [K_h_keep ; \hat k_h]
  new V_h ← [V_h_keep ; \hat v_h]

for each head h ∈ R:
  keep full original K_h, V_h

Perform attention as usual, now on the compressed caches.

4. Performance Metrics and Empirical Evaluation

RazorAttention delivers substantial reductions in KV-cache memory footprint, maintaining competitive accuracy and inference latency:

Setting	Metric	Result
Llama 2-7B-80K	KV-Cache Compression	$\geq$ 70%
LongBench (16 tasks)	Accuracy drop	$<$ 0.2 pp
Needle-in-a-Haystack	Accuracy w.r.t full-KV	Within 1–2%
Hardware (24 GB GPU)	Latency/memory	Near baseline

End-to-end benchmarks demonstrate that competing methods (e.g., StreamingLLM, H2O) either exceed memory budgets or incur severe accuracy loss at scale, while RazorAttention retains near-baseline latency and fits within commodity GPU constraints (Tang et al., 2024).

5. Compatibility with FlashAttention

The head-specific compression and compensation-token operations in RazorAttention are performed entirely at the cache management layer. The underlying FlashAttention kernel remains unchanged and interacts with a shorter KV-cache plus one additional token per non-retrieval head. This enables a plug-and-play deployment workflow in any inference pipeline already using FlashAttention, with no retraining or kernel modification necessary.

6. Limitations and Prospects for Future Research

RazorAttention employs fixed thresholds (1% echo, 14% induction) for retrieval head identification, which may require per-model retuning. While the current 70% reduction demonstrates significant potential, alternative compensation mechanisms utilizing higher-order moments or learnable aggregation tokens may enable further efficiency gains. The underlying reasons for why only a handful of heads become retrieval heads remain analytically unexplained. Extremely long context lengths ( $N \geq 100,000$ ) or repetitious data distributions may necessitate dynamic adaptation of buffer sizes ( $B_h$ ) for optimal fidelity.

A plausible implication is that further research into head specialization could yield deeper insights into transformer dynamics and unlock more aggressive cache reduction strategies without loss in semantic recovery. RazorAttention thus leverages the transformer’s retrieve-then-process attention head specialization to achieve substantial resource savings alongside robust task performance in long-context LLM inference (Tang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RazorAttention.