Full-Context Cache Selection
- Full-context cache selection is a set of algorithms and architectures that select, compress, and reuse key-value caches to manage long sequences in Transformer models.
- The methodology spans window-based, token-level, and entropy-adaptive strategies that balance semantic coherence, task adaptivity, and computational efficiency.
- Empirical evaluations show significant memory savings (up to 85%) and speed improvements, enabling robust LLM and MLLM deployments with minimal quality trade-offs.
Full-context cache selection refers to the class of algorithms and system architectures that explicitly select, compress, and/or reuse subsets of the key–value (KV) cache in Transformer-based models to enable efficient, accurate, and scalable inference across long contexts and large memory budgets. As context lengths and deployment-scale requirements have grown, full-context cache selection has become foundational for both LLM and MLLM deployments, governing inference speed, memory footprint, and attainable sequence lengths.
1. Motivation and Theoretical Underpinnings
Transformer inference with long sequences requires O(L·N·d) memory for key and value caches, where L is the number of layers and N the number of tokens. For industrial-scale LLMs with N ≫ 8K, this cache often exceeds available GPU memory. Naive cache eviction (FIFO, static windows) irreversibly drops context and degrades model quality, especially for tasks requiring retrieval or long-range dependency tracking.
Full-context cache selection algorithms are motivated by four theoretical desiderata:
- Semantic Coherence: Retain context that preserves contiguous, meaningful spans to avoid semantic fragmentation (Zuo et al., 23 Mar 2025).
- Task Adaptivity: Adapt selection to downstream requirements (QA, summarization, retrieval) (Zuo et al., 23 Mar 2025).
- Attention Density Awareness: Allocate cache as a function of per-layer or per-modality attention entropy (Wan et al., 24 Feb 2025, Cai et al., 4 Jun 2024).
- Output-Aware Structured Pruning: Evict KV-pairs by quantifying impact on attention outputs (not just accumulated attention) (Gu et al., 9 Oct 2025).
The evolution from uniform or progressive cache reduction (e.g., PyramidKV (Cai et al., 4 Jun 2024)) toward entropy-guided, task-adaptive, and output-aware methods tracks the increasing sophistication and specificity in cache selection.
2. Methodological Taxonomy
There are several broad categories of full-context cache selection algorithms, each grounded in distinct selection and compression mechanisms.
2.1 Window and Segment-Based Approaches
WindowKV partitions context into a recent “observation window” and a bucketized “review context” of sliding or fixed-size windows. Selection is governed by a task-type classifier (localization or aggregation), and selection occurs over contiguous blocks, preserving semantic order. Further, intra-group layer sharing reduces redundant compute by processing only the first layer in each group and sharing indices across layers (Zuo et al., 23 Mar 2025).
CacheFormer augments long-short attention with segment-level dynamic retrieval: compressed global attention identifies high-attention segments, which are then fetched/expanded in uncompressed form for subsequent attention. Cache and overlap attention are merged to mitigate fragmentation, allowing for high coverage at subquadratic cost (Singh et al., 18 Apr 2025).
2.2 Token-Level and Head-Wise Selection
TokenSelect constructs sparse, per-head, per-token importance scores using normalized Q·K logits and a head soft-voting mechanism. Selection is non-contiguous and dynamically adapts to query similarity through a "selection cache," which avoids recomputation when the query vector is similar to previous steps (Wu et al., 5 Nov 2024).
ZSMerge (ZeroMerge) employs multi-dimensional token importance metrics at head granularity, assigns fine-grained per-head budgets, and utilizes a compensated residual merging mechanism for tokens exceeding the budget. Attention is renormalized for merged slots, ensuring information preservation without retraining (Liu et al., 13 Mar 2025).
SAGE-KV performs a one-shot, self-attention-guided, top-k eviction, directly leveraging the model’s own last-token query to determine which tokens can be dropped per head or group, reducing the cache in a single, data-driven pass (Wang et al., 11 Mar 2025).
2.3 Layer- and Entropy-Adaptive Budgeting
PyramidKV and MEDA allocate cache budgets at a per-layer or per-modality level, informed by measured attention entropy: early layers with high entropy/diffuse attention receive larger caches, while later layers with more focused/sparse attention are allocated fewer slots (Cai et al., 4 Jun 2024, Wan et al., 24 Feb 2025). The cache allocation follows a pyramidal/arithmetic progression or softmax of entropy values.
2.4 Output-Aware Pruning
OBCache formulates cache eviction as a structured pruning problem at the layer level, deriving saliency metrics for isolated keys, values, and joint key–value pairs via Optimal Brain Damage–style Taylor expansions, directly measuring the perturbation on attention outputs rather than relying on attention mass heuristics (Gu et al., 9 Oct 2025).
2.5 Architectural and System-Level Solutions
ShadowKV compresses the key cache using low-rank SVD (pre-RoPE), stores only chunk-level "landmarks" and outlier caches on-GPU, while offloading the value cache to system memory. On-the-fly selection reconstructs only the minimal required sparse KV pairs per decoding step by scoring chunk landmarks and asynchronously fetching values (Sun et al., 28 Oct 2024).
MPIC addresses multimodal caching by partitioning context into position-independent, cacheable image blocks and non-cacheable text, orchestrating parallel recompute and I/O for "linked" KV assembly at inference, thus balancing response time and quality in varied retrieval/generation tasks (Zhao et al., 4 Feb 2025).
XC-Cache replaces prompt-based ICL with encoder-decoder-style cross-attention, caching only final encoder representations, thereby reducing cache size by over two orders of magnitude (Monteiro et al., 23 Apr 2024).
TLinFormer achieves exact full-context awareness via constrained cross- and self-attention layers, architecturally enforcing linear time and static cache structures by design (Tang, 28 Aug 2025).
3. Core Algorithms and Mathematical Formalism
Full-context cache selection mechanisms hinge on mathematical formulations for importance assessment, window forming, and optimal budget distribution.
3.1 Window/Segment Scoring
Let Q, K ∈ ℝ{(α×d_k)} be observation-window queries and full-context keys. Attention scores:
Per-token importance (review context):
Window/segment k aggregation:
3.2 Entropy-Based Allocation (MEDA, PyramidKV)
Cross-modal entropy at layer l:
Layerwise budget (MEDA):
PyramidKV arithmetic progression:
3.3 Output-Aware Saliency (OBCache)
Value-pruning saliency (layer ℓ, token i):
Key-pruning saliency:
Combined joint pruning includes cross-terms for full output impact.
3.4 Selection Pseudocode Skeleton
WindowKV prefill loop (core KV cache logic):
1 2 3 4 5 6 7 8 9 10 |
for i in 0..n−1: for layer in 0..m−1: K[layer][i], V[layer][i] ← compute_KV(layer, input_tokens[i]) ... for g in 0..H−1: ... A ← softmax( Q[l0][osc_tokens] · K[l0][review_tokens]^T / √d_k ) ... I_g ← union of token-positions in keep_windows ∪ osc_tokens ... |
4. Empirical Evaluation and Performance
Full-context cache selection algorithms consistently demonstrate dramatic memory savings (often 8–20×) and large throughput/latency improvements with negligible or minor loss in downstream metrics across diverse benchmarks.
- WindowKV retains ≈12% of the original KV cache with ≈1–2 F1/R-L/Accuracy drop vs. FullKV, matching or outperforming SLM, H2O, and PyramidKV in LongBench and Needle-in-a-Haystack (Zuo et al., 23 Mar 2025).
- MEDA achieves up to 85% KV reduction and 2.8× speedup on multimodal tasks with ≤0.5 drop in task scores (Wan et al., 24 Feb 2025).
- TokenSelect shows up to 23.84× speedup in attention computation, 2.28× end-to-end acceleration, and higher accuracy than InfLLM on million-token inputs using only a dynamic sparse token subset (e.g., k=2K+local512) (Wu et al., 5 Nov 2024).
- ShadowKV matches full-attention performance at a sparse budget κ≈1.56% (e.g., k=256 chunks per head with c=8 per chunk) and delivers up to 3.04× throughput boost and 6× larger batch size versus dense GPU KV caching (Sun et al., 28 Oct 2024).
- ZSMerge maintains XSum summarization ROUGE (30.60 vs 30.59 FullKV at 5% cache) and attains 20× memory reduction with triple throughput gain at 54k-token context length (Liu et al., 13 Mar 2025).
- SAGE-KV achieves 4× higher memory efficiency and better accuracy than static/StreamLLM, and 2× improvement over Quest, with a single pass per layer per sample (Wang et al., 11 Mar 2025).
- OBCache improves both retrieval and perplexity by 5–13 percentage points over attention-weight heuristics and is composable with H2O, SnapKV, TOVA tossers (Gu et al., 9 Oct 2025).
Empirical ablations highlight:
- The superiority of dynamic and output-aware selection over static windowing, under extreme memory compression.
- Graceful trade-off curves in speed, memory, and model quality, allowing practitioners to sweep budget ratios for application-specific constraints.
5. Task Adaptation and Semantic Coherence
Optimal full-context cache selection must preserve semantic continuity for tasks that rely on temporally or locally contiguous information (e.g., factual QA, code, summarization) while also being capable of supporting non-local, sparse dependencies for information retrieval.
WindowKV uses a binary classifier to select information-localization (cache full window for local tasks) versus aggregation (cache only high-importance tokens within a window). In practical settings, per-task adaptation achieves better retention of essential context than naive scoring alone, and intra-group sharing of indices across layers yields computational savings with minimal accuracy penalty (Zuo et al., 23 Mar 2025).
Multimodal cache selection (MEDA, MPIC) further extends these ideas to cross-modal attention entropy and task-aware recompute/reuse splits, enabling efficient and robust mixed-modality context retention (Wan et al., 24 Feb 2025, Zhao et al., 4 Feb 2025).
6. System-Level and Engineering Considerations
State-of-the-art production inference engines for large context LLMs and MLLMs leverage multi-level cache partitioning (GPU/CPU, SSD, network storage), pipelined/asynchronous I/O for value/text parts, and on-demand sparse reconstruction (e.g., ShadowKV’s SVD-key and outlier landmark mechanism). Compression is often combined with chunk- or segment-level localities to minimize random-access latency and further boost throughput.
Key engineering practices include:
- Asynchronous DMA pinning and batch launches (ShadowKV, MPIC).
- Circular buffer implementation and per-chunk/segment prefetch (ShadowKV, CacheFormer).
- Zero-shot (“plug-and-play”) compatibility for drop-in cache selection without retraining (ZSMerge, TokenSelect, SAGE-KV, OBCache).
- Quantitative tuning of hyperparameters (window size, cascade depth, budget %, selection/momentum decay) for peak empirical performance under resource constraints.
7. Practical Guidelines, Limitations, and Future Directions
Recommended practices for full-context cache selection include:
- Layer-, head-, and task-adaptive budget allocation, leveraging attention entropy or token-impact metrics.
- Ensuring semantic coherence by favoring contiguous spans/windows in tasks sensitive to contiguity.
- Employing output-aware saliency metrics for pruning under extreme memory budget or performance constraints.
- Combining block/segment retrieval with fine-grained token selection for maximal quality at minimal memory (CacheFormer, TokenSelect).
- Integrating engineering optimizations for parallel I/O, batch launches, and zero-copy memory to exploit modern hardware.
Limitations persist in tasks with highly dynamic or nonstationary context importance, and certain approaches (e.g., one-shot eviction) may degrade if critical information re-emerges in future output queries (Wang et al., 11 Mar 2025). Future work may combine output-aware metrics with lifelong online updating, integrate differentiable cache selectors, or more deeply unify architectural (TLinFormer) and runtime (ShadowKV, MPIC) layers for tighter memory/quality trade-off envelopes.
Full-context cache selection remains an open area of algorithmic and system innovation, central to the scalability and practicality of long-context models across text, code, and multimodal reasoning domains.