KVzip+: Query-Agnostic Cache Compression for LLMs

Updated 14 January 2026

KVzip+ is a query-agnostic key-value cache compression method that leverages information-theoretic context reconstruction in Transformer-based LLMs.
It employs LLM-guided importance scoring and a chunked framework to reduce the cache size by up to 4× with less than 1% accuracy loss on benchmarks like SQuAD and GSM8K.
KVzip+ significantly improves latency and memory efficiency in multi-query inference, outperforming traditional query-aware eviction strategies across various LLM architectures.

KVzip+ is an enhanced, query-agnostic key-value (KV) cache compression and eviction method designed for Transformer-based LLMs operating at long context lengths. Unlike prior query-aware approaches that optimize the cache for a specific current query, KVzip+ employs an information-theoretic, context-reconstruction–based framework to select the most utility-retentive KV pairs for preservation, supporting efficient and robust downstream multi-query LLM inference with minimal loss in accuracy and substantial gains in memory and latency efficiency (Kim et al., 29 May 2025).

1. Motivation and Context

LLMs cache every intermediate layer’s key-value projections for each token to enable efficient autoregressive decoding. For a context of length $n_c$ using $L$ layers and $H$ KV heads, the storage requirement scales as $\Theta(L\,H\,n_c\,d)$ , where $d$ is the dimensionality. As $n_c$ enters the range of tens to hundreds of thousands, the resultant cache can exhaust GPU memory and significantly decelerate attention kernels (e.g., FlashAttention), particularly as cache size is a key bottleneck in long-context inference.

Existing online query-aware eviction schemes (e.g., SnapKV, PyramidKV, H₂O) estimate KV slot importance per immediate query, evicting least-used entries. However, such approaches must be re-applied for each distinct query, precluding robust cache sharing across multiple queries or conversational turns. In contrast, applications such as personalized chatbots, retrieval-augmented generation, and large static document prefill necessitate a one-time, query-agnostic cache reduction that maintains downstream utility for arbitrary possible queries.

2. Algorithmic Foundations: Context Reconstruction and Importance Scoring

Central to KVzip+ is the use of the LLM itself as an information assessor. The method frames the selection of a minimal subset of KV pairs, $KV'\subseteq KV$ , such that, for all possible queries $q$ , the model output from $KV'$ closely matches that from the full $KV$ cache:

$L$ 0

KVzip+ operationalizes this via self-supervised, context-reconstruction. Specifically, the LLM is prompted (e.g., with “Repeat the previous context:”) and tasked to generate the original context using only the available KV cache. At each layer $L$ 1 and head $L$ 2, grouped queries $L$ 3 attend to keys $L$ 4. The cross-attention matrices $L$ 5 are computed:

$L$ 6

The importance of each cached position $L$ 7 is quantified as the maximum attention it receives across all groups and prompt positions:

$L$ 8

These scores, $L$ 9, serve as proxy measures for information salience during context reproduction.

To maintain scalability for very large $H$ 0, KVzip+ introduces a chunked scoring framework. The context is partitioned into non-overlapping chunks (typically $H$ 1), each processed separately with a short token overlap for continuity, reducing both peak memory and time complexity for the scoring pass from $H$ 2 to $H$ 3.

3. Eviction Protocol and Computational Properties

The core eviction scheme is as follows:

Prefill: Compute the full KV cache for the initial context.
Chunked Scoring: For each chunk, run the context-reconstruction forward pass and compute $H$ 4 for its positions.
Sorting: Aggregate all slotwise scores and sort.
Retention: Retain the top fraction $H$ 5 of slots with the highest $H$ 6, where $H$ 7 is the prescribed retention ratio.
Optional Head-level Eviction: Aggregate $H$ 8 along position index and evict entire heads, if desired.

Post-eviction, only the selected $H$ 9 entries are stored. Further compression is enabled via quantization (e.g., 4-bit using QServe or KIVI). At inference time, downstream prompts attend to the compacted KV cache with no modification to typical attention operators, aside from the reduced memory and indices.

A softmax-free variant is also described: one may directly use the QK product and max-across-queries for scoring without softmax normalization. This enables a marginal gain in scoring throughput with a modest reduction in compression fidelity.

4. Experimental Setup and Benchmark Performance

KVzip+ has been extensively evaluated across LLM architectures (Qwen2.5, LLaMA3.1, Gemma3), with context lengths up to approximately 170K tokens and a variety of cache quantization strategies (e.g., 4-bit for LLaMA3-8B). Benchmark tasks span information retrieval (Needle-in-a-Haystack, synthetic code), QA and reasoning (SQuAD, GSM8K, SCBench), summarization, multi-choice, in-context learning, and code comprehension (RepoQA).

KVzip+ achieves up to $\Theta(L\,H\,n_c\,d)$ 0 KV cache size reduction (25% retention) with less than 1% absolute accuracy drop on SQuAD and GSM8K, and approximately 94% performance retention on SCBench with only 30% of KV slots. FlashAttention decoding latency is halved as cache shrinks, and scoring overhead is approximately $\Theta(L\,H\,n_c\,d)$ 1 of initial prefill—amortizable over multiple downstream queries or minimized via static head-level scoring. Critically, KVzip+ outperforms query-aware baselines: with 0.3 retention, KVzip+ retains near-baseline performance, while SnapKV and PyramidKV drop to approximately 70–80%. At 90% retention, baseline accuracy typically declines by 5–10% in multi-query settings, a regime where KVzip+ is robust (Kim et al., 29 May 2025).

5. Implementation Details and Hardware Considerations

KVzip+ is designed for practical integration with modern inference pipelines. Chunked scoring enables linear scaling for context windows up to 170K tokens, with working memory requirements dominated by chunk size ( $\Theta(L\,H\,n_c\,d)$ 2). The score buffer incurs an $\Theta(L\,H\,n_c\,d)$ 3 cost, but decode-time storage is strictly proportional to retained slots. After scoring and eviction, quantization can be applied for compact cache storage (e.g., 4 GB → 0.4 GB at 30% retention with 4-bit quantization). The process is embarrassingly parallel over layers and chunks.

The method supports two principal deployment modes:

One-time offline: For static contexts or long-running conversations, precompute importance scores and evict post-preamble. The resulting compressed cache can be rapidly reused for diverse prompts.
Real-time: For highly dynamic or short-lived contexts, static head-level scoring (precomputed on representative data) amortizes the scoring cost across session instantiations.

Integration with hardware-aware attention kernels (e.g., custom FlashAttention with fused QK-max scoring, omitting softmax) is supported, trading off a modest (~10%) reduction in compression quality for improved throughput.

6. Limitations, Extensions, and Prospective Directions

Empirical evaluation demonstrates strong multi-query robustness, but there is no formal guarantee of information preservation—very adversarial or out-of-domain queries may elicit degraded model output if crucial KV pairs have been discarded. Highly compressed caches may, in some cases, suppress personal or rare information (e.g., phone numbers), potentially affecting privacy and shallow alignment behaviors. Computationally, scoring is approximately twice as expensive as the initial prefill, though this expense is amortized for static or infrequently updated contexts.

KVzip+ suggests several avenues for further research, including leveraging learned or multi-task prompts for context reconstruction, adaptive chunk sizing, hierarchical and hybrid (e.g., clustering-based) post-scoring cache reductions, and fusing importance emergence directly into model weights via end-to-end training. A plausible implication is that integrating the scoring pipeline into LLM optimization may yield models with intrinsic KV slot saliency measures for efficient cache utilization.

7. Comparative Perspective

Relative to query-aware and vector-quantization approaches, KVzip+ uniquely provides a query-agnostic, multi-query–robust method for cache reduction. Query-aware methods must be rerun for each new prompt and do not generalize to multi-turn settings. Scalar quantization and per-token vector quantization (e.g., VQLLM, KIVI) are orthogonal and can be composed with KVzip+ post-eviction for further size reductions. Compared to CommVQ (Li et al., 23 Jun 2025), KVzip+ achieves robust multi-query generalization via reconstruction-backed importance, whereas CommVQ offers maximal fixed-ratio per-token quantization with hardware-aware design; the methods can target complementary operational regimes.

In summary, KVzip+ is a scalable and practical solution for robust, query-agnostic KV cache compression. By leveraging the LLM’s own cross-attention scores during context reconstruction, it achieves up to fourfold reductions in cache footprint and halved attention decoding time with negligible impact on downstream accuracy, outperforming prior query-aware eviction strategies, and is broadly applicable across LLM architectures and deployment scenarios (Kim et al., 29 May 2025).

Markdown Report Issue Upgrade to Chat

References (2)

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction (2025)

CommVQ: Commutative Vector Quantization for KV Cache Compression (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KVzip+.