Papers
Topics
Authors
Recent
2000 character limit reached

Full-Context Cache Selection

Updated 16 December 2025
  • Full-context cache selection is a set of algorithms and architectures that select, compress, and reuse key-value caches to manage long sequences in Transformer models.
  • The methodology spans window-based, token-level, and entropy-adaptive strategies that balance semantic coherence, task adaptivity, and computational efficiency.
  • Empirical evaluations show significant memory savings (up to 85%) and speed improvements, enabling robust LLM and MLLM deployments with minimal quality trade-offs.

Full-context cache selection refers to the class of algorithms and system architectures that explicitly select, compress, and/or reuse subsets of the key–value (KV) cache in Transformer-based models to enable efficient, accurate, and scalable inference across long contexts and large memory budgets. As context lengths and deployment-scale requirements have grown, full-context cache selection has become foundational for both LLM and MLLM deployments, governing inference speed, memory footprint, and attainable sequence lengths.

1. Motivation and Theoretical Underpinnings

Transformer inference with long sequences requires O(L·N·d) memory for key and value caches, where L is the number of layers and N the number of tokens. For industrial-scale LLMs with N ≫ 8K, this cache often exceeds available GPU memory. Naive cache eviction (FIFO, static windows) irreversibly drops context and degrades model quality, especially for tasks requiring retrieval or long-range dependency tracking.

Full-context cache selection algorithms are motivated by four theoretical desiderata:

The evolution from uniform or progressive cache reduction (e.g., PyramidKV (Cai et al., 4 Jun 2024)) toward entropy-guided, task-adaptive, and output-aware methods tracks the increasing sophistication and specificity in cache selection.

2. Methodological Taxonomy

There are several broad categories of full-context cache selection algorithms, each grounded in distinct selection and compression mechanisms.

2.1 Window and Segment-Based Approaches

WindowKV partitions context into a recent “observation window” and a bucketized “review context” of sliding or fixed-size windows. Selection is governed by a task-type classifier (localization or aggregation), and selection occurs over contiguous blocks, preserving semantic order. Further, intra-group layer sharing reduces redundant compute by processing only the first layer in each group and sharing indices across layers (Zuo et al., 23 Mar 2025).

CacheFormer augments long-short attention with segment-level dynamic retrieval: compressed global attention identifies high-attention segments, which are then fetched/expanded in uncompressed form for subsequent attention. Cache and overlap attention are merged to mitigate fragmentation, allowing for high coverage at subquadratic cost (Singh et al., 18 Apr 2025).

2.2 Token-Level and Head-Wise Selection

TokenSelect constructs sparse, per-head, per-token importance scores using normalized Q·K logits and a head soft-voting mechanism. Selection is non-contiguous and dynamically adapts to query similarity through a "selection cache," which avoids recomputation when the query vector is similar to previous steps (Wu et al., 5 Nov 2024).

ZSMerge (ZeroMerge) employs multi-dimensional token importance metrics at head granularity, assigns fine-grained per-head budgets, and utilizes a compensated residual merging mechanism for tokens exceeding the budget. Attention is renormalized for merged slots, ensuring information preservation without retraining (Liu et al., 13 Mar 2025).

SAGE-KV performs a one-shot, self-attention-guided, top-k eviction, directly leveraging the model’s own last-token query to determine which tokens can be dropped per head or group, reducing the cache in a single, data-driven pass (Wang et al., 11 Mar 2025).

2.3 Layer- and Entropy-Adaptive Budgeting

PyramidKV and MEDA allocate cache budgets at a per-layer or per-modality level, informed by measured attention entropy: early layers with high entropy/diffuse attention receive larger caches, while later layers with more focused/sparse attention are allocated fewer slots (Cai et al., 4 Jun 2024, Wan et al., 24 Feb 2025). The cache allocation follows a pyramidal/arithmetic progression or softmax of entropy values.

2.4 Output-Aware Pruning

OBCache formulates cache eviction as a structured pruning problem at the layer level, deriving saliency metrics for isolated keys, values, and joint key–value pairs via Optimal Brain Damage–style Taylor expansions, directly measuring the perturbation on attention outputs rather than relying on attention mass heuristics (Gu et al., 9 Oct 2025).

2.5 Architectural and System-Level Solutions

ShadowKV compresses the key cache using low-rank SVD (pre-RoPE), stores only chunk-level "landmarks" and outlier caches on-GPU, while offloading the value cache to system memory. On-the-fly selection reconstructs only the minimal required sparse KV pairs per decoding step by scoring chunk landmarks and asynchronously fetching values (Sun et al., 28 Oct 2024).

MPIC addresses multimodal caching by partitioning context into position-independent, cacheable image blocks and non-cacheable text, orchestrating parallel recompute and I/O for "linked" KV assembly at inference, thus balancing response time and quality in varied retrieval/generation tasks (Zhao et al., 4 Feb 2025).

XC-Cache replaces prompt-based ICL with encoder-decoder-style cross-attention, caching only final encoder representations, thereby reducing cache size by over two orders of magnitude (Monteiro et al., 23 Apr 2024).

TLinFormer achieves exact full-context awareness via constrained cross- and self-attention layers, architecturally enforcing linear time and static cache structures by design (Tang, 28 Aug 2025).

3. Core Algorithms and Mathematical Formalism

Full-context cache selection mechanisms hinge on mathematical formulations for importance assessment, window forming, and optimal budget distribution.

3.1 Window/Segment Scoring

Let Q, K ∈ ℝ{(α×d_k)} be observation-window queries and full-context keys. Attention scores:

A=softmax(QK/dk)A = \mathrm{softmax}(Q K^\top / \sqrt{d_k})

Per-token importance (review context):

tj=i=nαn1Ai,jt_j = \sum_{i=n-\alpha}^{n-1} A_{i,j}

Window/segment k aggregation:

sk=1min(p,ω)sum(Top ⁣ ⁣p(Wk))s_k = \frac{1}{\min(p, \omega)} \cdot \mathrm{sum}(\operatorname{Top}\!-\!p(W_k))

3.2 Entropy-Based Allocation (MEDA, PyramidKV)

Cross-modal entropy at layer l:

ECMl=ETVl+EVTlE_{CM}^l = E_{TV}^l + E_{VT}^l

Layerwise budget (MEDA):

Sl=S(Lρ)exp(ECMl)k=1Lexp(ECMk)S_l = S \cdot (L \rho) \frac{\exp(E_{CM}^l)}{\sum_{k=1}^L \exp(E_{CM}^k)}

PyramidKV arithmetic progression:

kl=k0(k0km1)mlk^l = k^0 - \frac{(k^0 - k^{m-1})}{m}l

3.3 Output-Aware Saliency (OBCache)

Value-pruning saliency (layer ℓ, token i):

ΔVi()=j=ws[Aj,i()]2vi()22\Delta V_i^{(\ell)} = \sum_{j=w}^s [A^{(\ell)}_{j,i}]^2 \cdot \| v_i^{(\ell)} \|_2^2

Key-pruning saliency:

ΔKi()=j=ws[Aj,i()Zj,i()]2vi()oj()22\Delta K_i^{(\ell)} = \sum_{j=w}^s [A^{(\ell)}_{j,i} Z^{(\ell)}_{j,i}]^2 \cdot \| v_i^{(\ell)} - o_j^{(\ell)} \|_2^2

Combined joint pruning includes cross-terms for full output impact.

3.4 Selection Pseudocode Skeleton

WindowKV prefill loop (core KV cache logic):

1
2
3
4
5
6
7
8
9
10
for i in 0..n1:
    for layer in 0..m1:
        K[layer][i], V[layer][i]  compute_KV(layer, input_tokens[i])
...
for g in 0..H1:
    ...
    A  softmax( Q[l0][osc_tokens] · K[l0][review_tokens]^T / d_k )
    ...
    I_g  union of token-positions in keep_windows  osc_tokens
    ...
Selections are similarly cast as top-k/max pooling, potentially chunked, merged, or subjected to per-layer and per-head budget heuristics.

4. Empirical Evaluation and Performance

Full-context cache selection algorithms consistently demonstrate dramatic memory savings (often 8–20×) and large throughput/latency improvements with negligible or minor loss in downstream metrics across diverse benchmarks.

  • WindowKV retains ≈12% of the original KV cache with ≈1–2 F1/R-L/Accuracy drop vs. FullKV, matching or outperforming SLM, H2O, and PyramidKV in LongBench and Needle-in-a-Haystack (Zuo et al., 23 Mar 2025).
  • MEDA achieves up to 85% KV reduction and 2.8× speedup on multimodal tasks with ≤0.5 drop in task scores (Wan et al., 24 Feb 2025).
  • TokenSelect shows up to 23.84× speedup in attention computation, 2.28× end-to-end acceleration, and higher accuracy than InfLLM on million-token inputs using only a dynamic sparse token subset (e.g., k=2K+local512) (Wu et al., 5 Nov 2024).
  • ShadowKV matches full-attention performance at a sparse budget κ≈1.56% (e.g., k=256 chunks per head with c=8 per chunk) and delivers up to 3.04× throughput boost and 6× larger batch size versus dense GPU KV caching (Sun et al., 28 Oct 2024).
  • ZSMerge maintains XSum summarization ROUGE (30.60 vs 30.59 FullKV at 5% cache) and attains 20× memory reduction with triple throughput gain at 54k-token context length (Liu et al., 13 Mar 2025).
  • SAGE-KV achieves 4× higher memory efficiency and better accuracy than static/StreamLLM, and 2× improvement over Quest, with a single pass per layer per sample (Wang et al., 11 Mar 2025).
  • OBCache improves both retrieval and perplexity by 5–13 percentage points over attention-weight heuristics and is composable with H2O, SnapKV, TOVA tossers (Gu et al., 9 Oct 2025).

Empirical ablations highlight:

  • The superiority of dynamic and output-aware selection over static windowing, under extreme memory compression.
  • Graceful trade-off curves in speed, memory, and model quality, allowing practitioners to sweep budget ratios for application-specific constraints.

5. Task Adaptation and Semantic Coherence

Optimal full-context cache selection must preserve semantic continuity for tasks that rely on temporally or locally contiguous information (e.g., factual QA, code, summarization) while also being capable of supporting non-local, sparse dependencies for information retrieval.

WindowKV uses a binary classifier to select information-localization (cache full window for local tasks) versus aggregation (cache only high-importance tokens within a window). In practical settings, per-task adaptation achieves better retention of essential context than naive scoring alone, and intra-group sharing of indices across layers yields computational savings with minimal accuracy penalty (Zuo et al., 23 Mar 2025).

Multimodal cache selection (MEDA, MPIC) further extends these ideas to cross-modal attention entropy and task-aware recompute/reuse splits, enabling efficient and robust mixed-modality context retention (Wan et al., 24 Feb 2025, Zhao et al., 4 Feb 2025).

6. System-Level and Engineering Considerations

State-of-the-art production inference engines for large context LLMs and MLLMs leverage multi-level cache partitioning (GPU/CPU, SSD, network storage), pipelined/asynchronous I/O for value/text parts, and on-demand sparse reconstruction (e.g., ShadowKV’s SVD-key and outlier landmark mechanism). Compression is often combined with chunk- or segment-level localities to minimize random-access latency and further boost throughput.

Key engineering practices include:

  • Asynchronous DMA pinning and batch launches (ShadowKV, MPIC).
  • Circular buffer implementation and per-chunk/segment prefetch (ShadowKV, CacheFormer).
  • Zero-shot (“plug-and-play”) compatibility for drop-in cache selection without retraining (ZSMerge, TokenSelect, SAGE-KV, OBCache).
  • Quantitative tuning of hyperparameters (window size, cascade depth, budget %, selection/momentum decay) for peak empirical performance under resource constraints.

7. Practical Guidelines, Limitations, and Future Directions

Recommended practices for full-context cache selection include:

  • Layer-, head-, and task-adaptive budget allocation, leveraging attention entropy or token-impact metrics.
  • Ensuring semantic coherence by favoring contiguous spans/windows in tasks sensitive to contiguity.
  • Employing output-aware saliency metrics for pruning under extreme memory budget or performance constraints.
  • Combining block/segment retrieval with fine-grained token selection for maximal quality at minimal memory (CacheFormer, TokenSelect).
  • Integrating engineering optimizations for parallel I/O, batch launches, and zero-copy memory to exploit modern hardware.

Limitations persist in tasks with highly dynamic or nonstationary context importance, and certain approaches (e.g., one-shot eviction) may degrade if critical information re-emerges in future output queries (Wang et al., 11 Mar 2025). Future work may combine output-aware metrics with lifelong online updating, integrate differentiable cache selectors, or more deeply unify architectural (TLinFormer) and runtime (ShadowKV, MPIC) layers for tighter memory/quality trade-off envelopes.

Full-context cache selection remains an open area of algorithmic and system innovation, central to the scalability and practicality of long-context models across text, code, and multimodal reasoning domains.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Full-Context Cache Selection.