Papers
Topics
Authors
Recent
2000 character limit reached

KVzip: Query-Agnostic KV Cache Compression

Updated 29 December 2025
  • KVzip is a key-value cache compression method that reduces memory bottlenecks during long-context inference by compressing transformer caches.
  • It leverages context reconstruction and model-driven importance scoring to aggressively shrink cache size by up to 4× while preserving performance.
  • KVzip operates in a training-free, black-box manner and outperforms query-aware baselines across retrieval, reasoning, and code comprehension tasks.

KVzip is a query-agnostic key-value (KV) cache compression technique for transformer-based LLMs, designed to alleviate memory and computational overhead caused by burgeoning context lengths during inference. By leveraging context reconstruction and model-driven importance scoring, KVzip enables aggressive cache reduction (typically by 3–4×) while maintaining high downstream performance across retrieval, reasoning, and code comprehension tasks in both single- and multi-query scenarios. It outperforms query-aware baselines by generalizing cache retention to unseen future queries without training or intrusive model modifications (Kim et al., 29 May 2025).

1. Rationale and Problem Formulation

Transformer LLMs cache per-token KV projections at each layer and head to support efficient autoregressive inference. As the context length ncn_c grows, the storage of KV={(Kl,h,i,Vl,h,i)}KV = \{(K_{l,h,i},V_{l,h,i})\}, with each Kl,h,i,Vl,h,iRdK_{l,h,i},V_{l,h,i}\in\mathbb{R}^d for layers l=1,,Ll=1,\ldots,L, heads h=1,,Hh=1,\ldots,H, and tokens i=1,,nci=1,\ldots,n_c, results in O(LHncd)O(L H n_c d) memory complexity. For context lengths on the order of 100K–170K, the KV cache size (e.g., 33 GB for Qwen2.5-14B at 120K tokens using FP16) becomes the dominant memory consumer, often exceeding model parameters and inducing a quadratic latency bottleneck due to expansive softmax attention windows (Kim et al., 29 May 2025).

The challenge is to compress or evict KV pairs to enable scalable inference with minimal performance loss, while supporting arbitrary future queries over the context.

2. Query-Agnostic Importance Scoring via Context Reconstruction

KVzip fundamentally departs from query-aware eviction by computing per-KV-pair importance based on how much a given key-value pair supports reconstruction of the original context, directly leveraging the LLM's own cross-attention distributions. The scoring procedure is as follows:

  1. Context Setup: Construct a prompt consisting of “Repeat the previous context:” followed by the original context cc.
  2. Teacher-Forced Pass: Run the LLM over this prompt and context, reusing the full prefilled KVKV cache. For a head hh of layer ll, let Ql,hRG×nin×dQ_{l,h}\in\mathbb{R}^{G\times n_\text{in}\times d} (where GG is grouped-query size, ninn_\text{in} is the prompt+context length) and Kl,hR(nc+nin)×dK_{l,h}\in\mathbb{R}^{(n_c+n_\text{in})\times d}.
  3. Cross-Attention Extraction: Compute Al,h=Softmax(Ql,hKl,h)A_{l,h} = \mathsf{Softmax}(Q_{l,h} K_{l,h}^\top), a tensor in R+G×nin×(nc+nin)\mathbb{R}_+^{G\times n_\text{in}\times (n_c + n_\text{in})}. Restrict to columns that correspond to original-context KV positions.
  4. Importance Definition: For KV pair (l,h,i)(l,h,i), assign

Sl,h,i=maxg=1G,t=1nin[Aˉl,h]g,t,iS_{l,h,i} = \max_{g=1\ldots G,\, t=1\ldots n_\text{in}} [\bar{A}_{l,h}]_{g,t,i}

where Aˉl,h\bar{A}_{l,h} indexes the cross-attention for token ii.

Empirically, this scoring emphasizes KV pairs that are most “looked at” during self-reconstruction, and correlates with those needed for a wide range of downstream queries (Kim et al., 29 May 2025).

3. Efficient Chunked Context Processing

Computational challenges arise from the scale (O(nc2)O(n_c^2) attention complexity) of full-context importance scoring. KVzip addresses this by:

  • Splitting the context cc into T=nc/mT = \lceil n_c / m \rceil contiguous chunks of length mm (e.g., m=2m = 2K).
  • For each chunk, the context reconstruction prompt is assembled and processed, extracting per-chunk importance scores for the relevant KV pairs.
  • This chunked approach reduces scoring complexity to O(mnc)O(m n_c) and peak memory to O(m2)O(m^2) per chunk, similar to FlashAttention's streaming paradigm (Kim et al., 29 May 2025).

A Softmax-free variant merges QKQ K^\top max-across-queries directly within a custom kernel, trading ∼10% accuracy for slightly improved speed.

4. Eviction and Compression Policy

After scoring:

  1. All Sl,h,iS_{l,h,i} scores are flattened and a cache budget ratio r%r\% is selected.
  2. The (1r)(1-r) quantile τ\tau of all scores is computed.
  3. KV pairs (l,h,i)(l,h,i) with Sl,h,i<τS_{l,h,i} < \tau are evicted; the remainder constitute the compressed cache KVKVKV' \subset KV.
  4. Both uniform (per-head, per-layer) and nonuniform (global) budgets are supported.

The eviction can be implemented by a simple sort–threshold mechanism, compatible with head-level or layer-level aggregation (Kim et al., 29 May 2025).

5. Empirical Evaluation and Benchmarking

Extensive experiments span LLaMA3.1-8B, Qwen2.5-7B/14B, and Gemma3-12B on an NVIDIA A100 80GB with context lengths up to 170K. Benchmarks include Needle-in-a-Haystack, SQuAD, GSM8K, code retrieval, and mix-task multi-query settings. Quantitative outcomes include:

KV Ratio KVzip Accuracy (Qwen2.5-7B-1M) H₂O SnapKV PyramidKV
0.1 92% 40% 35% 30%
0.3 98% 70% 60% 55%
0.5 99% 85% 80% 75%
1.0 100% 100% 100% 100%
  • Up to 4× cache reduction (e.g., 16.3GB→4.1GB at 25% cache in LLaMA3.1-8B).
  • ∼2× decoding speed-up (e.g., FlashAttention per-layer 0.39ms→0.17ms).
  • Near-perfect task accuracy even at 30% cache (≥95% relative), while query-aware baselines degrade at 90%.
  • Context-independent, head-level scoring enables rapid pre-computation and efficient multi-query generalization (Kim et al., 29 May 2025).

6. Advantages over Query-Aware and Prior Compression Methods

KVzip's context reconstruction-based importance identifies KV pairs critical for arbitrary future queries within a context. As a result:

  • Query-aware eviction (e.g., SnapKV, PyramidKV, H₂O) optimizes for initial queries and exhibits rapid performance decay in multi-query or unseen-query scenarios.
  • KVzip's scores correlate strongly with the model's attention patterns during actual downstream tasks, as demonstrated in joint importance-overlap histograms (Fig. 4 of (Kim et al., 29 May 2025)).
  • KVzip operates in a training-free, black-box setting; it requires no model retraining, auxiliary parameters, or intrusive offlining.

A plausible implication is that this method inherently supports persistent long-context caches suitable for dynamic query workloads.

7. Limitations, Extensions, and Current Research Directions

Key limitations include:

  • Context-dependent scoring incurs approximately 2× prefilling cost compared to standard inference, though this can be amortized in multi-query or persistent deployments.
  • No theoretical upper bound on worst-case performance degradation; guarantees are empirical.
  • Chunked scoring and Softmax-free variants partially mitigate computational cost, at a minor accuracy trade-off.
  • The method sometimes "erases" cache regions containing personal data, leading to a form of privacy alignment and refusal for associated queries, as observed empirically (Table 5 of (Kim et al., 29 May 2025)).

Extensions under consideration involve context-independent compression and further hardware integration. KVzip forms part of a broader family of KV cache optimizers, including adaptive quantization methods (e.g., ZipCache (He et al., 23 May 2024)), residual dynamic approaches (e.g., ZSMerge (Liu et al., 13 Mar 2025)), and commutative quantization with vector codecs (e.g., CommVQ (Li et al., 23 Jun 2025)).


For comprehensive details, consult "KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction" (Kim et al., 29 May 2025), as well as foundational works on ZipCache (He et al., 23 May 2024), ZSMerge (Liu et al., 13 Mar 2025), and CommVQ (Li et al., 23 Jun 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to KVzip.