Papers
Topics
Authors
Recent
2000 character limit reached

Query-Agnostic KV Cache Compression

Updated 29 December 2025
  • Query-agnostic KV cache compression is a method that preselects and compresses LLM key-value caches independent of future query tokens.
  • It leverages techniques such as SVD, frequency transforms, and mixed-precision quantization to balance strong memory reduction with minimal accuracy loss.
  • Empirical studies reveal significant cache reductions, throughput gains, and near-full accuracy in long-context LLM inference applications.

Query-agnostic key-value (KV) cache compression defines a class of methods for reducing the memory footprint of LLM/Transformer KV caches independently of any future, unknown query tokens. Unlike query-aware methods, which rely on attention patterns from specific prompts, query-agnostic compression must preselect and retain a subset or compressed transformation of the original KV cache such that any downstream prompt, question, or decoding step can be accurately serviced—ideally with minimal memory and compute overhead. Approaches span geometric scoring, value-centric matrix decompositions, frequency transforms, mixed-precision, cross-layer sharing, adaptive chunking, lossless quantization, and architectural innovations. The sections below enumerate core algorithmic principles, summarize major methodological families, present relevant quantitative results, and position the field in the broader spectrum of long-context LLM inference research.

1. Definition, Scope, and Theoretical Foundations

Query-agnostic KV cache compression eliminates the need to recompute or peek at attention matrices derived from any future queries during the compression stage. The core challenge is to identify–using only information available before any new token is generated–which KV representations to retain, merge, discard, or compress, all while maximizing the ability to reconstruct attention outputs for arbitrary subsequent queries.

Theoretically, query-agnosticity imposes a constraint: the compression process cannot access the per-query softmax(QKT) attention structure. Consequently, effective schemes must exploit (a) statistical geometry of the keys and/or values, (b) attention patterns induced from the cached context itself, (c) dimensionality or data redundancy within the cache, or (d) task/semantic proxies for information diversity. Guaranteeing that ∀q the approximation error in ot = softmax(qKT)V remains bounded typically involves spectral preservation (CUR, leverage scores), information-theoretic inequalities (see Compactor), or explicit unbiasedness principles (e.g., FAEDKV's IWDFT for equal positional representation) (Chari et al., 10 Jul 2025, Sengupta et al., 18 Sep 2025, Li et al., 26 Jul 2025).

2. Methodological Classes

The landscape of query-agnostic KV compression can be decomposed into the following principal mechanisms:

2.1 Token and Chunk Selection Schemes

  • Semantic retrieval and head-of-interest scoring: CompressKV first identifies, per layer, a handful of Semantic Retrieval Heads (SRHs) most responsible for retrieving semantically critical tokens, via aggregate attention across ground-truth spans. Token importance is then scored solely via these heads, and a layer-adaptive error-aware budget is assigned using normalized reconstruction losses. This avoids the known pitfall of streaming heads overfitting to prefix/suffix tokens (Lin et al., 4 Aug 2025).
  • Statistical leverage and attention blending: Compactor combines stat-leverage "outlierness" (from the SVD of K or a random projection) with non-causal self-attention scores, producing a blended, parameter-free importance ranking for retention. It further supports context-calibrated dynamic compression to maximize memory savings at a fixed quality guarantee (Chari et al., 10 Jul 2025).
  • Value-guided decomposition: CurDKV uses CUR decomposition of V to compute token leverage scores, ensuring that retained KV pairs best preserve the attention output's dominant subspace. This value-centric approach directly targets reconstruction error in attention, unlike previous schemes based solely on Q–K affinities (Sengupta et al., 18 Sep 2025).
  • Semantic chunking: ChunkKV aggregates tokens into contiguous "semantic chunks" and computes representative chunk-level keys/values. Chunk selection is top-k based on context-relevant criteria, and layer-wise index reuse amortizes overhead and improves throughput (Liu et al., 1 Feb 2025).
  • Context reconstruction for universal reuse: KVzip employs self-supervised context reconstruction–measuring, for every cache element, the maximal attention it receives when the model is tasked to reconstruct its own context. The resulting context-agnostic scores generalize across arbitrary downstream queries (Kim et al., 29 May 2025).

2.2 Dimensionality and Structural Compression

  • Frequency-domain transform: FAEDKV compresses the KV cache in the Fourier domain, using an Infinite-Window DFT that ensures every token contributes equally to each frequency coefficient. Offline frequency ablation selects the frequency bands critical for each layer, and only these are retained, ensuring unbiased positional representation (Li et al., 26 Jul 2025).
  • Autoencoder and latent vector reuse: KV-CAR trains compact autoencoder pairs for each layer to minimize MSE in reconstructing both K and V, substantially reducing storage. Additionally, heads with high similarity across layers are reused, collapsing redundancy without architecture changes (Roy et al., 7 Dec 2025).
  • Cross-layer sharing: CommonKV uses SVD to extract a shared latent projection for adjacent layers, merging parameters and compressing the cache via latent vector pooling. A cosine-based similarity score dictates which layer groups are merged, with Fisher information weighting controlling merge precision (Wang et al., 22 Aug 2025).

2.3 Precision Quantization

  • Mixed-precision partitioning: MiKV divides the cache into a small, high-precision importance set and a much larger, aggressively quantized retained set, using per-row and outlier-aware quantization and per-channel balancing to mitigate the error from extreme precision loss (Yang et al., 28 Feb 2024).
  • Block/channelwise quantization and entropy coding: KVComp applies block-wise, channel-wise quantization to K and token-wise to V, followed by highly efficient GPU-parallel Huffman encoding. The process is fully query-agnostic; decompression and attention are fused into the same kernel for optimal performance (Jiang et al., 30 Aug 2025).

2.4 System and Hardware-Centric Methods

  • Zero-delay architectural integration: FDC compresses Q, K, V natively within the projection matrices via an offline-baked SVD, producing all compact representations at runtime without any decompression step. Kernel grouping and workload balancing for varied head sizes further optimize compute efficiency (Zhang et al., 7 Aug 2024).
  • Merging-based cache reduction: KeepKV systematically merges near-duplicate KV pairs using an Electoral Votes bookkeeping scheme, extending standard merge approaches to maintain strict attention distribution invariance (via closed-form "ZIP" merging) and provably bounded downstream perturbation, even after multiple merges (Tian et al., 14 Apr 2025).
  • Two-stage hybrid top-k pruning: RocketKV combines a coarse, prompt-wide SnapKV++ eviction with a fast hybrid sparse top-k stage using page-based head/sequence min-max proxies for attention score approximation, yielding compression ratios up to 400× and decode speedups of up to 3.7× (Behnam et al., 19 Feb 2025).

2.5 Streaming and Vision-LLM Contexts

  • Streaming video context and fixed memory cap: StreamMem processes video streams in an online manner, using generic query proxies (e.g., chat template tokens) to compute token saliency, pruning the cache according to these query-agnostic attention scores and merging per-frame prototypes for hierarchical summarization (Yang et al., 21 Aug 2025). InfiniPot-V enforces a strict cap by mixing Temporal-axis Redundancy and Value-Norm ranking to select/fuse distinctive tokens, supporting arbitrary-length continuous streams (Kim et al., 18 Jun 2025).

3. Algorithmic Summaries and Core Formulas

Method Core Scoring/Compression Objective Query-Agnostic?
CompressKV Token importance via Semantic Retrieval Heads; error-aware layer budget Yes
Compactor SVD-based leverage + blended non-causal attention (no query info) Yes
CurDKV Value-centric leverage score via CUR SVD; top-k V row selection Yes
ChunkKV Chunk-level representation; top-k chunk retention; index reuse Yes
FAEDKV Fourier transform (IWDFT) of KV sequence; keep high-impact frequencies Yes
MiKV Mixed high/low-precision partitioning with per-channel quantization Yes
KVzip Max attention during context reconstruction task (repeat prompt) Yes
KeepKV Electoral votes merge / ZIP merging; attention preserving bookkeeping Yes
FDC Baked-in SVD projection in QKV/linear weights, compressed at source Yes
KVComp Quantization + entropy coding, tightly co-designed for hardware Yes

Each method achieves query-agnosticity by not requiring access to future queries and instead depending solely on (i) context-side keys/values/hidden states/statistical properties, (ii) calibration on held-out corpora, or (iii) proxies derivable from stored representations.

4. Empirical Findings and Benchmark Comparisons

Key empirical observations:

  • Accuracy retention: CompressKV achieves >97% QA accuracy on LongBench with only 3% of the full KV cache (Lin et al., 4 Aug 2025). Compactor maintains full-KV-level accuracy (averaging 95–100% of full-KV score) at 38% token retention, while SnapKV/Pyramid require 72–75% retention for the same performance (Chari et al., 10 Jul 2025).
  • Throughput/memory efficiency: ChunkKV achieves up to 90% cache reduction with ≤2% performance loss and a 26.5% throughput gain on LLaMA-3-8B (Liu et al., 1 Feb 2025). RocketKV reaches a 400× compression with only 1.5% accuracy loss, delivering up to 3.7× decode speedup (Behnam et al., 19 Feb 2025).
  • Unbiased compression: FAEDKV’s frequency-based approach ensures no positional bias, outperforms SnapKV and eviction models by up to 22% on tight budgets, and delivers position-invariant retrieval accuracy in NIAH benchmarks (Li et al., 26 Jul 2025).
  • Streaming/multimodal context: InfiniPot-V reduces GPU memory by up to 94% within a fixed budget, with ≤1% loss in video question-answering accuracy (Kim et al., 18 Jun 2025). StreamMem demonstrates competitive performance to query-aware methods at a fraction of the memory/latency (Yang et al., 21 Aug 2025).
  • Quantization and mixed-precision: MiKV at 20% importance/high-precision and 80% low-precision (INT2) recovers 92–100% accuracy on retrieval tasks; naive eviction or uniform quantization performs 4–43% (Yang et al., 28 Feb 2024). KVComp achieves up to 83% reduction without incurring more than 3% drop in downstream metrics (Jiang et al., 30 Aug 2025).
  • CUR-decomposition: CurDKV exceeds SnapKV by up to 9.6% accuracy under 30–50% retention on LongBench and 18% in NIAH retrieval, and reduces generation latency by 40% (Sengupta et al., 18 Sep 2025).
  • Layer/cross-head sharing: CommonKV realizes ≈95% accuracy at 50% cache compression, and is highly synergistic with quantization and eviction, collectively achieving up to 98% total reduction (Wang et al., 22 Aug 2025).

5. Limitations, Trade-offs, and Best Practices

  • Fixed vs. adaptive budgets: Fixed-percentage or per-layer budgets can cause performance cliffs in “hard” contexts. Adaptive allocation using measured layer-wise sensitivity (CompressKV), dynamic singular value drop (CurDKV), and context-calibrated quality fits (Compactor) reduces this risk (Lin et al., 4 Aug 2025, Sengupta et al., 18 Sep 2025, Chari et al., 10 Jul 2025).
  • Semantic vs. statistical selection: Purely geometric or matrix-decomposition-based methods may miss semantic importance latent in the context; semantic chunking (ChunkKV), context reconstruction (KVzip), and semantic head selection (CompressKV) mitigate this (Liu et al., 1 Feb 2025, Kim et al., 29 May 2025, Lin et al., 4 Aug 2025).
  • Precision–speed tradeoff: Mixed-precision and lossy quantization approaches introduce vanishing but nonzero errors; carefully designed outlier balancing (MiKV), channelwise scaling (KVComp), and high-precision “importance” partitions are required for robust generation (Yang et al., 28 Feb 2024, Jiang et al., 30 Aug 2025).
  • Layer-sharing and merge perturbations: Aggressive cross-layer merging (CommonKV) or pairwise ZIP merging (KeepKV) require error control to maintain output fidelity. Fine-grained similarity checking and closed-form perturbation bounds are essential (Wang et al., 22 Aug 2025, Tian et al., 14 Apr 2025).
  • Streaming settings: In video/multimodal LLMs, streaming Query-agnostic compression must operate online under strict memory caps. Heuristics such as TaR/VaN (InfiniPot-V) and generic-proxy attention (StreamMem) show effectiveness without query knowledge but may slightly underperform on highly query-specific tasks (Kim et al., 18 Jun 2025, Yang et al., 21 Aug 2025).

6. Synthesis and Outlook

Query-agnostic KV cache compression has matured into a field with a taxonomy of efficient, theoretically grounded schemes. These enable long-context LLM inference—often surpassing heuristic or entropy-based baselines in both memory reduction and downstream task preservation. The architecture-agnostic nature of most schemes (statistical leverage, frequency transforms, mixed-precision, SVD-baked projections) allows broad deployment across decoder-only LLMs, grouped query attention variants, and multimodal MLLMs.

Open research directions include: (1) integrating value-centric and geometric signals with lightweight learning or dynamic adaptation at runtime; (2) extending frequency-domain transforms to hierarchical or wavelet bases; (3) further reducing per-query runtime costs by fusing compression stages within hardware kernels (as in FDC and KVComp); (4) developing richer adaptive budget allocation using prefill statistics or incremental update policies; and (5) benchmarking methods under adversarial, multi-detail, and extreme long-context workloads.

The current literature demonstrates that, with careful token/scoring and compression methodology, query-agnostic schemes can approach or even match full-KV accuracy in diverse long-context and multi-query applications, while delivering substantial practical gains in memory and inference throughput (Lin et al., 4 Aug 2025, Godey et al., 4 Mar 2025, Chari et al., 10 Jul 2025, Sengupta et al., 18 Sep 2025, Kim et al., 18 Jun 2025, Wang et al., 22 Aug 2025, Yang et al., 28 Feb 2024, Liu et al., 1 Feb 2025, Kim et al., 29 May 2025, Li et al., 26 Jul 2025, Roy et al., 7 Dec 2025, Jiang et al., 30 Aug 2025, Zhang et al., 7 Aug 2024, Behnam et al., 19 Feb 2025, Tian et al., 14 Apr 2025, Yang et al., 21 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Query-Agnostic KV Cache Compression.