Surrogate Scoring for KV Eviction
- The paper pioneers surrogate scoring by using attention-derived proxy metrics to identify less critical cache entries for targeted KV eviction.
- It integrates diverse algorithmic methodologies including attention-based, proxy model, and hash-based strategies to optimize cache retention with minimal performance loss.
- Empirical results demonstrate up to 90% accuracy retention with significant memory reduction and increased throughput across large language models and vision transformers.
Surrogate Scoring for Efficient KV Eviction
KV-cache eviction is a class of inference-time techniques for reducing the memory footprint of attention-based models, especially LLMs and vision transformers, by removing or compressing less critical key-value (KV) pairs from the cache. Surrogate scoring refers to the formalization and computation of token- or segment-level importance using proxy metrics or auxiliary models, in order to identify which cached entries can be pruned with minimal impact on model output quality. Recent work has established surrogate scoring as a central principle in efficient cache management, driving advances in throughput, scalability, and robustness across diverse architectures and long-context benchmarks (Zhao et al., 3 Aug 2025, Wang et al., 11 Mar 2025, Wang et al., 24 May 2025, Liao et al., 29 Nov 2025, Gu et al., 4 Jun 2025, Chen et al., 2024, Tian et al., 14 Apr 2025, Zeng et al., 2024).
1. Fundamental Concepts in Surrogate Scoring
Surrogate scoring functions quantify the estimated future relevance of cached tokens without using direct, stepwise gradient or loss feedback. Most approaches derive the score from attention weights, derived features of the hidden states, or auxiliary models. Canonical surrogate scores include:
- Accumulated attention from queries to keys (classical, but suffers positional bias) (Gu et al., 4 Jun 2025, Chen et al., 2024).
- Mean or max attention from a recent (or pseudo) window of queries (as in SAGE-KV, Lookahead Q-Cache) (Wang et al., 11 Mar 2025, Wang et al., 24 May 2025).
- Proxies derived from small LLMs or learnable tokens that approximate importance under global or future attention (SmallKV, Judge Q) (Zhao et al., 3 Aug 2025, Liu et al., 13 Sep 2025).
- Redundancy or diversity metrics—for example, cosine dissimilarity to mean anchor (KeyDiff), output error after token drop (CAOTE), or Hamming distance under locality-sensitive hashing (HashEvict) (Park et al., 21 Apr 2025, Goel et al., 18 Apr 2025, Liu et al., 2024).
- Value vector statistics to refine or correct attention-derived metrics (AhaKV, CAOTE, KeepKV) (Gu et al., 4 Jun 2025, Goel et al., 18 Apr 2025, Tian et al., 14 Apr 2025).
Surrogate scores are typically computed immediately before an eviction event, and are used to select the subset of cache entries to retain (e.g., top-k by score) or to merge, relocate, or compress them into different storage classes.
2. Algorithmic Methodologies
Approaches to surrogate scoring and KV eviction fall into several methodological families:
2.1. Attention-Based and Temporal Scoring
The classic family uses accumulative or rolling-window attention scores as surrogates for future importance, where for token ,
This principle underlies H₂O, RoCo, and SAGE-KV, with extensions for blocked, grouped, or mean/max-reduced variants for efficiency and stability (Ren et al., 2024, Wang et al., 11 Mar 2025). To correct for bias, AhaKV introduces adaptive softmax scaling, where the softmax temperature is analytically set to maintain fixed support size independent of the cache length (Gu et al., 4 Jun 2025).
2.2. Proxy and Surrogate Model Approaches
Recent methods employ auxiliary models or proxy computations as surrogates:
- SmallKV utilizes a small-scale LLM surrogate, matching each head in the large model to that in the small model via Jaccard similarity over top-K attended tokens. The small model’s full attention cache is then employed to proxy global token importance, solving both saliency shift and marginal over-compression (Zhao et al., 3 Aug 2025).
- Judge Q adds a set of trainable soft tokens to the input; their attention maps are aligned to those of actual decoded tokens during training, providing global context for importance estimation at prefill time (Liu et al., 13 Sep 2025).
- KVzap trains a lightweight layer-wise function to approximate per-token, per-head log attention/value scores, dramatically reducing inference overhead versus oracle two-pass methods (Jegou et al., 12 Jan 2026).
2.3. Lookahead and Query Augmentation
Lookahead Q-Cache generates pseudo-query vectors by partially decoding with a pruned cache, and uses these as surrogates for future inference-time queries in scoring retention candidates (Wang et al., 24 May 2025). This approach consistently outperforms pure prefill attention strategies under tight budgets.
2.4. Redundancy, Diversity, and Value-Based Methods
KeyDiff selects tokens by geometric distinctiveness in key space, maximizing key diversity as a proxy for future import (i.e., minimizing the sum of pairwise cosine similarities to the anchor) (Park et al., 21 Apr 2025). CAOTE and AhaKV integrate the value vector or attention output error directly into the scoring function, with CAOTE providing a closed-form for the output perturbation after each potential eviction (Goel et al., 18 Apr 2025, Gu et al., 4 Jun 2025). KeepKV further eliminates first-step inference perturbation by a merging-with-votes mechanism and zero-output difference constraint (Tian et al., 14 Apr 2025).
2.5. Hybrid and Randomized Strategies
NACL and related approaches combine attention-proxy selection (via a task-relevant subset such as final question tokens) with random sampling of the cache under a distribution derived from proxy scores, improving robustness and reducing attention bias (Chen et al., 2024). SkipKV leverages sentence-level redundancy scoring (pairwise similarity of pooled hidden activations) and an adaptive steering direction to enhance semantic coherence and brevity (Tian et al., 8 Dec 2025).
2.6. Pre-Attention and Hash-Based Surrogates
HashEvict applies a pre-attention, locality-sensitive hash scoring between the current query and each cached key. The Hamming distance between binary projections serves as a scalable proxy for expected cosine similarity and thus attention weight, permitting fast eviction on GPU with minimal overhead (Liu et al., 2024).
3. Error Sources, Bias Mitigation, and Robustness
Several empirical and theoretical analyses have shown that naïve application of accumulated attention as a retention surrogate induces significant bias, most notably favoring either the oldest tokens (due to early attention concentration) or most recent tokens (due to autoregressive de-normalization) (Gu et al., 4 Jun 2025, Chen et al., 2024). AhaKV analytically proves and corrects for positional decay by coupling softmax scaling to attention entropy, and further leverages value prior statistics to diversify retention.
Eviction methods that use only attention scores can suffer from saliency shift and ignore collectively important but marginal tokens. SmallKV addresses these by consulting an auxiliary small model’s full cache at each decode step (Zhao et al., 3 Aug 2025).
In scenarios where attention importance is difficult to predict (e.g., multi-batch reasoning, or under heavy padding), segment-level or hybrid methods such as SkipKV, which combine attention, self-similarity, and semantic redundancy, yield much higher robustness (Tian et al., 8 Dec 2025).
Randomized and hybrid retention mechanisms (e.g., as in NACL) introduce probabilistic coverage over the context and prevent catastrophic loss of critical but low-scoring tokens (Chen et al., 2024).
4. Empirical Performance and Throughput Trade-offs
State-of-the-art surrogate-scoring approaches demonstrate several key empirical results:
- At extreme cache budgets (e.g., 5–20% of full cache), SmallKV and AhaKV maintain >90% of baseline accuracy in long-context QA, reasoning, and summarization, with throughput improvements of 1.75–2.56× over classical attention-score greeds (Zhao et al., 3 Aug 2025, Gu et al., 4 Jun 2025).
- SAGE-KV achieves near full-attention accuracy (e.g., 52.49% vs. 53.01% on Llama3.1-8B, LongBench) at 4× improved memory efficiency, outperforming StreamLLM and Quest (Wang et al., 11 Mar 2025).
- Lookahead Q-Cache and LAQ++ boost accuracy by ~3–4.6% at tight budgets, and reach 99.2% recall in the Needle-in-a-Haystack retrieval, well above prior baselines (Wang et al., 24 May 2025).
- KeyDiff’s attention-free, O(N) scoring achieves a ≤ 0.04% gap in downstream performance (LongBench, 8K tokens, 23% memory reduction) (Park et al., 21 Apr 2025).
- CAOTE, when wrapping H₂O or TOVA, improves average LongBench accuracy by 10–15 percentage points under strong cache compression (e.g., from ∼26%→∼40% at 4k on Llama 8B) (Goel et al., 18 Apr 2025).
- KVzap attains 2–4× compression with <1% compute/memory overhead and matches or slightly exceeds the two-pass KVzip oracle, with R²≈0.7 for log-score prediction per head (Jegou et al., 12 Jan 2026).
- HashEvict compresses cache by 30–70% with <1–2% performance degradation while increasing inference speed by up to 17× during prefill (Liu et al., 2024).
The following table summarizes key efficiency-accuracy trade-offs for representative methods under tight cache budgets:
| Method | Peak Throughput (tok/s) | Memory Reduction | Acc. Gap to Full | Notes |
|---|---|---|---|---|
| SmallKV | 195–1203 | 80–90% | ≤ 6% at 5–20% | >2× baseline, SLM required |
| SAGE-KV | ~2× baseline | 4× | 0.5–1% | One-pass, head-wise top-k |
| KeyDiff | Up to 30% latency ↓ | 23% | ≤ 0.04% | Attention free |
| AhaKV | N/A | 90%+ | ≤ 0.31% (top-4) | Adaptive scaling, value norm |
| HashEvict | 1.5–17× (prefill) | 30–70% | <2% | Pre-attention, GPU-kernel |
| CAOTE+H₂O | N/A | N/A | +10–15pp abs | Minimizes output error |
5. Theoretical Guarantees and Optimization Properties
Theoretical analyses have specified error bounds, bias properties, and surrogate validity:
- AhaKV proves the monotonic decay and unbiasedness of positional attention scores, and provides explicit temperature scheduling to maintain fixed “focus” under entropy scaling (Gu et al., 4 Jun 2025).
- KeyDiff shows that geometric distinctiveness yields an O(N) proxy for maximally diverse key sets, with formal relation to maximal attention support (Park et al., 21 Apr 2025).
- CAOTE’s error-minimizing score tightly bounds perturbation in the downstream hidden state (Appendix proof), justifying its empirical improvements (Goel et al., 18 Apr 2025).
- KeepKV provides formally bound multi-step output perturbations under merging with EMA-predicted surrogate weights and electoral votes, with zero first-step error at merge (Tian et al., 14 Apr 2025).
- HashEvict’s SimHash construction yields unbiased angle estimation, with concentration improving at larger projection length m (Liu et al., 2024).
- Multi-objective frameworks (EVICPRESS) optimize utility functions that balance quality and delay; solutions provably align with observed latency-accuracy Pareto frontiers (Feng et al., 16 Dec 2025).
6. Practical Considerations and Implementation Guidance
Best practices in deploying surrogate scoring include:
- Percentile-based, windowed, or entropy-adaptive thresholding for robust token selection (AhaKV, SmallKV) (Zhao et al., 3 Aug 2025, Gu et al., 4 Jun 2025).
- Modular scoring wrappers: CAOTE can be used as a post-hoc loss correction on top of any nonnegative scoring scheme (Goel et al., 18 Apr 2025); SmallKV is plug-and-play for any FlashAttention-style implementation (Zhao et al., 3 Aug 2025).
- Surrogate model approaches (Judge Q, KVzap) require minimal fine-tuning (new token embeddings or light MLPs) and add negligible inference overhead (Jegou et al., 12 Jan 2026, Liu et al., 13 Sep 2025).
- Sliding window or “most recent” pinning is critical to prevent degeneration at extreme compression (Jegou et al., 12 Jan 2026).
- Multi-batch, multi-user, and cross-task settings benefit from marginal or semantic-aware scoring (SkipKV, SmallKV, NACL) (Tian et al., 8 Dec 2025, Zhao et al., 3 Aug 2025, Chen et al., 2024).
- Safe-guarding coordinate-defining or initial tokens is necessary in structural tasks (Evict3R) (Mahdi et al., 22 Sep 2025).
- Randomized proxy-merged or hybrid schemes improve coverage and robustness in adversarial or variable-context work (Chen et al., 2024).
7. Open Problems and Directions
Despite rapid progress, surrogate scoring for KV eviction presents several open challenges:
- Dynamic adaptation to task-specific or unpredictable patterns in multi-turn or streaming contexts.
- Learning surrogates end-to-end without full-model retraining; model-agnostic and hardware-friendly surrogates (e.g., KeyDiff, HashEvict).
- Optimal layering or composition of different metrics (e.g., blending value/error, geometric, global and local attention).
- Cross-device, multi-tier storage, and joint eviction+compression optimization (as in EVICPRESS) (Feng et al., 16 Dec 2025).
- Efficient handling of segment-level or sentence-level coherence under aggressive compression (SkipKV) (Tian et al., 8 Dec 2025).
- Analytical characterization of surrogate bias and loss under shifted attention regimes and saliency transitions (SmallKV, G-KV) (Zhao et al., 3 Aug 2025, Liao et al., 29 Nov 2025).
Surrogate scoring underpins essentially all high-performance KV-eviction systems for modern LLM inference. Ongoing work at the intersection of attention theory, information geometry, and scalable engineering continues to shape the field’s trajectory.