Query-Aware KV Cache Selection

Updated 14 May 2026

Query-aware KV cache selection is a set of techniques that dynamically match cached key-value pairs with the specific semantic intent of the user query.
Methods like ProphetKV, CapKV, and DapQ employ query-to-token scoring and mutual information metrics to selectively recompute and retain critical information.
Empirical results demonstrate improvements in accuracy (up to 50.9 percentage points) and memory efficiency by overcoming the limitations of query-agnostic heuristics.

Query-aware KV cache selection refers to the set of algorithmic and system-level techniques that explicitly incorporate the semantic or structural relationship between the current user query and the cached key-value (KV) pairs in LLM architectures. The aim is to optimize which KV pairs are retained, recomputed, allocated higher precision, or dynamically recalled—based on their direct or inferred relevance to the explicit decoding query or the anticipated output sequence. Recent advances address both the inefficiencies and correctness gaps that arise when KV cache management blindly compresses, evicts, or prunes tokens without accounting for actual query intent or downstream context. This article surveys the main theoretical frameworks, algorithmic strategies, and empirical outcomes in the area, drawing from a cross-section of recent research.

1. Motivation and Semantic Limitations of Query-Agnostic Selection

Early KV cache management in long-context LLM inference typically relied on query-agnostic heuristics, such as retaining the most recent tokens, evicting by least-recently-used policy, or applying static sparsity patterns. These approaches either assume uniform token importance, ignore downstream decoding, or base retention on prompt-side attention scores only. As context windows scale into tens or hundreds of thousands of tokens and are used in retrieval-augmented generation (RAG), such heuristics incur significant accuracy penalties—including the well-documented "crowding-out effect," in which tokens globally important but irrelevant to the specific user query saturate the limited recomputation or retention budget, displacing genuinely query-critical evidence (Wang et al., 31 Jan 2026).

The semantic and structural gap arises because relevance, as needed for minimal-loss generation, is fundamentally query-conditioned—only a subset of tokens or features contribute to the model's predictive distribution for a specific query. Empirically, strictly query-agnostic selection can drop retrieval QA or reasoning performance by up to 86% relative to full recomputation at strict compression budgets (Wang et al., 31 Jan 2026), driving research toward explicitly query-aware methods.

2. User-Query-Aligned Token Selection: Direct and Surrogate Approaches

Query-aware selection encompasses several algorithmic paradigms. A recurring design is direct ranking or scoring of tokens (or KV features) by their relevance to the current or anticipated query, often in the form of:

Query-to-token attention metrics (e.g., summing attention weights from the current query to each candidate token (Wang et al., 31 Jan 2026, Wang et al., 11 Mar 2025)),
Mutual information or information-theoretic proxies for the predictive contribution of retained KV pairs on future queries (Yang et al., 28 Apr 2026),
Statistical leverage scores, which quantify the incremental predictive capacity of a token's value vector under expected query distributions (Yang et al., 28 Apr 2026).

ProphetKV: User-Query-Driven Recomputation in RAG

ProphetKV exemplifies an explicit, user-query-driven approach for RAG, where only tokens receiving significant attention from actual query positions are prioritized for selective recomputation. The procedure:

Runs a lightweight forward pass using only query tokens to compute average query-to-token attention scores per layer.
Fuses these scores across layers.
Selects the top fraction (e.g., 20%) of tokens by aggregate attention.
Recomputation is restricted to these indices, effectively bridging the cross-chunk attention gap with minimal budget.

This pipeline is designed to minimize an upper bound on semantic loss between full prefill and reused cache attention, maximizing accuracy with as little as 20% recomputation budget—retaining 96–101% of full-prefill accuracy and improving up to 50.9% over query-agnostic baselines (Wang et al., 31 Jan 2026). Its effectiveness stems from leveraging the real query to identify what additional cross-attention information must be recovered, directly mitigating the crowding-out of query-irrelevant but globally salient tokens.

CapKV: Mutual-Information–Maximizing Eviction

CapKV formalizes cache retention as maximizing the conditional mutual information between future query vectors and retained KV pairs using a linear-Gaussian surrogate of attention. The resultant selection objective is to maximize

$I(q;Y|Z_C) = \frac{1}{2} \log\det\left[I+\Sigma_\text{noise}^{-1} A \Lambda_Q A^\top\right]$

where $A$ encodes the directions from retained value vectors, and weighting is query-informed (e.g., exponentially increasing with key-query alignment). Practical implementation scores tokens with statistical leverage scores weighted by running means or covariances of recent query statistics. Conditioning on instantaneous or historical queries allows the algorithm to preserve the maximum predictive information for actual, rather than hypothetical, queries (Yang et al., 28 Apr 2026).

DapQ: Decoding-Aligned Selection with Position-Aware Pseudo Queries

DapQ addresses prompt observation–only methods' failures by synthesizing a small set of pseudo queries, positioned as they would be during generation, and computing aggregate attention from these pseudo queries onto all prompt tokens. Ablation studies show that the positional embedding predominates in determining which prompt tokens will be crucial for future decoding. Retaining top-scoring tokens identified by summed pseudo-query attention achieves near-lossless performance (99.5% on NIAH with only 3% of the KV cache) under tight compression, dramatically outperforming prompt-only scoring (Tian et al., 12 Mar 2026).

3. Query-Aware Selection at Various Structural Granularities

Beyond token-level selection, query-aware approaches have been generalized to:

Page/block selection: Systems such as TinyServe compute per-page upper bound scores on query-key activations (bounding-box metadata) for dynamic, per-step query-aware top-K page selection and sparse loading (Liu et al., 28 Aug 2025).
Channel-level or feature-level selection: SPARK and MixKVQ perform query-conditioned channel pruning or mixed-precision quantization, selecting channels for retention or higher bitwidth representation based on the product of quantization hardness and recent query activity (Liao et al., 21 Aug 2025, Zhang et al., 22 Dec 2025). This addresses the observed fact that only a sparse subset of channels exhibit high salience for any given query step.
Semantic clustering and k-NN retrieval: IceCache uses a DCI-tree to cluster semantically similar tokens, with query-time top-K nearest neighbor search to fetch pages most relevant to the instantaneous query (Mao et al., 12 Apr 2026).
Graph-based selection: GraphKV builds a semantic similarity graph over tokens and propagates decay signals to dynamically reduce redundancy and promote diversity among tokens with high query-aligned scores, ensuring that context both matches and complements the present query (Li et al., 30 Aug 2025).

4. Application-Specific and Task-Aware Optimization

Several methods tailor query-aware selection for domain- or application-specific features:

RAG and prefill reuse: ProphetKV targets chunk-assembled RAG prompts by computing attention exclusively from query tokens, ensuring that recomputation budget is allocated where it will reconstruct missing information critical to the current user question (Wang et al., 31 Jan 2026).
Relational schema and structured queries: TableCache precomputes per-table KV blocks incorporating schema dependencies, employs trie-based lookup for mapping incoming queries to required tables, and reranks query execution to maximize table cache reuse across query batches (Su et al., 13 Jan 2026).
Task-adaptive retention: WindowKV employs a lightweight task classifier to adapt window selection granularity (localization vs. aggregation) according to the downstream prompt, e.g., QA or summarization, modulating query-to-window attention scoring accordingly (Zuo et al., 23 Mar 2025).

5. Empirical Benchmarks and Demonstrated Gains

Query-aware KV cache selection methods report consistent, often dramatic, improvements over query-agnostic or heuristic baselines:

ProphetKV achieves 8.8–50.9 percentage point absolute accuracy improvements on RULER and LongBench compared to state-of-the-art partial recomputation (Wang et al., 31 Jan 2026).
DapQ outperforms prior prompt-side-only selection algorithms by over 30 percentage points under extreme KV budgets (1–3% cache retained), retaining near-perfect accuracy on retrieval benchmarks (Tian et al., 12 Mar 2026).
CapKV and GraphKV improve over prior attention- and geometry-based heuristics, demonstrating both smoother accuracy decay under increasing compression and higher robustness to multi-key and long-context input regimes (Yang et al., 28 Apr 2026, Li et al., 30 Aug 2025).
SPARK and MixKVQ achieve >75% memory savings with negligible loss in reasoning benchmarks by focusing compression budgets on the small subset of tokens and channels that actively contribute to real-time attention, with empirical throughput doubling as memory budgets are held fixed (Liao et al., 21 Aug 2025, Zhang et al., 22 Dec 2025).
WindowKV and TableCache preserve near-full-KV downstream performance while shrinking retained tokens to as low as 12–25% of full cache size (Zuo et al., 23 Mar 2025, Su et al., 13 Jan 2026).
System-level implementations such as FreeKV and TinyServe achieve up to 13× reduction in retrieval latency or 3.4× GPU efficiency speedup with robust query-aligned recall (Liu et al., 19 May 2025, Liu et al., 28 Aug 2025).

Table: Representative Empirical Results from Query-Aware KV Cache Selection Literature

Method	Benchmark	Accuracy (%) vs. Full KV	Memory/Compute Budget
ProphetKV	RULER	96–101	20% recomputation
DapQ	NIAH	99.5	3% KV cache
CapKV	LongBench	+1–3 over next best	0.25–0.9 compression
IceCache	LongBench	99.0	25% KV cache on GPU
WindowKV	LongBench	–0.88 pt (trailing full)	12% per-layer cache
MixKVQ (C2.3–2.7)	Reasoning	<2 loss vs. BF16	17% of original KV size

All table values reflect results reported directly in the cited sources (Wang et al., 31 Jan 2026, Tian et al., 12 Mar 2026, Yang et al., 28 Apr 2026, Mao et al., 12 Apr 2026, Zuo et al., 23 Mar 2025, Zhang et al., 22 Dec 2025).

6. Practical Guidelines, Trade-Offs, and Open Directions

Implementing query-aware selection requires operational integration with existing inference pipelines, balancing algorithmic overhead with memory, computational, and downstream accuracy constraints. Key recommendations include:

Use forward attention from actual or synthesized queries to score candidates (not from prompt tokens alone) (Wang et al., 31 Jan 2026, Tian et al., 12 Mar 2026).
Maintain rolling or empirically updated statistics of recent queries to adapt scoring/eviction thresholds online (Yang et al., 28 Apr 2026, Zhang et al., 22 Dec 2025).
For maximal benefit at tight budgets, combine granularity: token, window, channel, and structural (table/page/block) levels as appropriate (Zuo et al., 23 Mar 2025, Liao et al., 21 Aug 2025, Su et al., 13 Jan 2026).
Employ system-level pipelining and out-of-path selection where possible to hide the selection and transmission overhead, as in FreeKV’s speculative recall (Liu et al., 19 May 2025).
Tune compression, quantization, or pruning ratios to the latency/accuracy budget and empirical task sensitivity.

Open challenges involve further aligning retention with non-local or multi-hop reasoning needs, robust adaptation to dynamically shifting long-context prompts, and coordinated multi-user or multi-query sharing in batch inference (Wu et al., 26 Jan 2026).

7. Theoretical Frameworks and Limitations

Information-theoretic perspectives (e.g., CapKV) unify many heuristics under a single capacity-maximization principle, admitting provable (1–1/e)-approximate guarantees for greedy selection. However, practical integration may still require approximations for high-speed batch inference at scale. Empirically, all such methods remain sensitive to strong topic non-stationarity and may require hybridization with geometric or attention-centric proxies when queries are not well-represented by recent statistics or when streaming/incremental centroids require approximation. Structural query invariances (e.g., join graphs in Text-to-SQL or table repeats across queries) can also be leveraged effectively for further speedups and reuse (Su et al., 13 Jan 2026, Wu et al., 26 Jan 2026).

In summary, query-aware KV cache selection systematically aligns memory and compute allocation with the true semantic and structural needs of the model’s next decoding step by explicit modeling of the query-KV relationship. This direction has enabled near-oracle accuracy under strict memory and recomputation budgets, with sharp improvements in efficiency and accuracy relative to query-agnostic selection. The field now covers token, channel, block, and even structural-contextual KV selection, underpinned by a mixture of empirical scoring, information-theoretic optimization, and real-world deployment in large-scale serving systems (Wang et al., 31 Jan 2026, Yang et al., 28 Apr 2026, Tian et al., 12 Mar 2026, Zhang et al., 22 Dec 2025, Liu et al., 28 Aug 2025, Zuo et al., 23 Mar 2025).