Papers
Topics
Authors
Recent
2000 character limit reached

KV Cache Reuse in Transformers

Updated 25 January 2026
  • KV Cache Reuse is a methodology that manages intermediate transformer key/value tensors, reducing redundant computations during inference.
  • It employs query-agnostic compression and selective recomputation to maintain high performance while significantly lowering memory footprint and latency.
  • System designs integrate GPU and DRAM/SSD hierarchies with dynamic scheduling to support multi-turn, multi-tenant, and multi-agent scenarios in scalable deployments.

Key-Value (KV) cache reuse encompasses methodologies and system designs that enable LLMs and vision-LLMs (VLMs) to avoid redundant computation by persisting, compressing, and selectively recomputing the intermediate key/value (KV) tensors generated during transformer inference. As LLMs and VLMs scale to longer contexts and serve diverse, multi-query, or multi-tenant scenarios, KV cache reuse has become essential to achieve efficient inference in deployment, with direct impact on memory, throughput, and latency.

1. Foundations of KV Cache Reuse

In Transformer decoders, each token processed results in per-layer key and value tensors that are stored as the KV cache. For a sequence of ncn_c tokens, LL layers, and HH heads, this cache scales as O(LHncd)O(L H n_c d) memory and induces O(ncnq)O(n_c n_q) compute per attention call on each new token. For practical LLMs processing up to 170K tokens, KV cache alone may exceed the parameter memory footprint (e.g., ~33 GB for Qwen2.5‑14B with 120K tokens in FP16) and dominate latency, particularly in memory-bound environments or large-batch inference (Kim et al., 29 May 2025).

Traditional caching enables skipping recomputation of KV pairs for repeated or overlapped prompt prefixes, reducing “prefill” compute and memory (Li et al., 18 Mar 2025). However, effective and robust KV cache reuse—including under partial overlaps, non-strict prefix matches, or multi-query settings—requires principled approaches to selection, compression, and cache management.

2. Query-Agnostic Compression and Reuse Algorithms

A fundamental challenge in KV cache reuse is retaining only the contextually important KV pairs that support robust downstream inference across arbitrary queries (“query-agnostic reuse”), while minimizing both cache size and performance loss. KVzip introduces a “context reconstruction” objective: the model is prompted to repeat the original context, then importance scores for all KV pairs are computed based on maximal cross-attention during forced reconstruction (Kim et al., 29 May 2025). Specifically, for each KV pair ii, the attention matrix Al,hA_{l,h} at layer ll, head hh yields an importance score

Sl,h[i]=maxg,jAˉl,h[g,j,i]S_{l,h}[i] = \max_{g, j} \bar{A}_{l,h}[g, j, i]

where higher values indicate greater contribution to reconstructing the context.

Retaining the top rr-fraction of these pairs yields a compressed cache KVKV', which can be reused for any downstream query qq with negligible task degradation, as empirically demonstrated on LLaMA3.1, Qwen2.5, and Gemma3 models with context lengths up to 170K tokens. At compression ratios of 3–4×, task scores (e.g., SQuAD, GSM8K, RepoQA) decrease by less than 1% even at 30% cache retention, and average performance remains ≥98% of baseline. Unlike query-aware eviction (e.g., SnapKV, PyramidKV), which can overfit to one query and degrade by >5% at moderate retention, KVzip’s reconstruction scoring delivers uniform, multi-query robustness (Kim et al., 29 May 2025).

3. System Designs and Economic Implications

KV cache reuse can be realized at multiple system granularities: from standalone model-serving nodes with cache co-located in GPU RAM, to scalable cloud deployments with DRAM/SSD hierarchy and cross-session cache sharing (Li et al., 18 Mar 2025, Feng et al., 28 Aug 2025).

A typical reuse pipeline includes:

  • Storing KV caches for commonly used prompt prefixes or session contexts
  • Indexing cache entries by semantic embedding or prompt hash for efficient lookup
  • Selectively recomputing only modified or newly inserted tokens via partial attention
  • Applying lossy or lossless compression and retention policies to fit device constraints

Cloud-optimized frameworks introduce cost models accounting for GPU compute, network transfer, and storage. The economic viability depends on reuse frequency ff, prefix lengths PP, and model size SS. The break-even reuse fraction fminf_{\min} can be characterized as

fminβc1SPαPaSγc2SPf_{\min} \approx \frac{\beta c_1 S P}{\alpha P a S - \gamma c_2 S P}

indicating that even modest context reuse rates (10–20%) suffice to yield net cost and latency reductions in practical deployments (Li et al., 18 Mar 2025).

Advanced scheduling policies, such as bi-directional compute/load pipelines (Cake), further optimize Time-to-First-Token (TTFT) by overlapping cache fetches from storage with on-GPU computation and adapting dynamically to bandwidth and compute fluctuations (Jin et al., 2024).

4. Multi-Turn, Multi-Tenant, and Multi-Agent Scenarios

Real-world deployments must handle diverse and dynamic KV reuse patterns:

  • In multi-turn chat, both single-turn and repeated-session prompt KV blocks exhibit heavy-tailed, category-predictable reuse statistics (Zipf α1.2\alpha\approx1.2) (Wang et al., 3 Jun 2025)
  • For multi-tenant LLM serving, semantic retrieval combined with edit-based KV recomputation (as in KVShare) enables efficient cache reuse for semantically similar but not byte-identical prompts across users (Yang et al., 17 Mar 2025)
  • In multi-agent pipelines (e.g., code synthesis with selection by an LLM judge), naïve KV reuse strategies can undermine cross-candidate attention critical for correct selection, as diagnosed by drops in Judge Consistency Rate despite stable overall task accuracy; explicit preservation of inter-candidate interaction is required for robust judge-centric inference (Liang et al., 13 Jan 2026)

System-level cache eviction must consider both per-block reuse probability and spatial locality within workload categories, with workload-aware policies significantly outperforming LRU/FIFO under constrained capacity (Wang et al., 3 Jun 2025).

5. Specialized Compression and Reuse Methods

A spectrum of approaches—beyond simple prefix caching—achieve more aggressive memory and latency reduction:

  • Cross-layer and Head-wise Reuse
    • Cross-layer schemes (FusedKV, KVSharer) perform learnable or heuristic fusion of lower and middle layer K/V caches to reconstruct higher layers, enabling up to 50% memory reduction with maintained or even improved perplexity (Lin et al., 3 Dec 2025, Yang et al., 2024).
    • Head-wise similarity-driven reuse (KV-CAR) identifies redundant heads across adjacent layers and proxies their K/V instead of recomputing, complementing autoencoder-based compression to approach 48% reduction with minor perplexity change (Roy et al., 7 Dec 2025).
  • Dynamic, Segment-wise, or Role-aware Compression
    • Thought-adaptive compression (ThinKV) assigns per-segment quantization and eviction based on chain-of-thought attention sparsity and supports in-place GPU slot reuse, achieving up to 5.8× throughput and <5% memory footprint in reasoning tasks (Ramachandran et al., 1 Oct 2025).
    • RazorAttention exploits attention head heterogeneity: full retention for retrieval heads and aggressive token dropping with compensation for others, reliably achieving ≥70% KV reduction with <2% degradation (Tang et al., 2024).
  • Retrieval-Augmented and Specialized Pipelines
    • Retrieval-augmented generation (Cache-Craft, HyperRAG, KVLink) enables chunk-level KV cache identification and partial recomputation (via attentional CCI/CFO or tri-mask inference), unlocking up to 4–5× compute savings and 2–3× throughput improvements in large-scale RAG scenarios (Agarwal et al., 5 Feb 2025, An et al., 3 Apr 2025, Yang et al., 21 Feb 2025).
    • For infilling tasks, prompt rewriting (EFIM) ensures both prefix and suffix KV reuse, necessitating specialized fragment tokenization pretraining to enable subtoken continuation and yielding up to 98% throughput gains in production code-assistant workloads (Guo et al., 28 May 2025).
    • Vision-LLMs are amenable to layer-wise, position-weighted KV recomputation, as in VLCache, which recomputes only the most error-propagating early vision tokens and leverages dynamic, layer-adaptive budgeting to balance >99% accuracy with ≥1.2–16× TTFT improvement (Qin et al., 15 Dec 2025).

6. Trade-Offs, Failure Modes, and Best Practices

Aggressive KV eviction or compression risks under-retaining infrequently attended yet critical context, with variable impact depending on downstream tasks and multi-query session pattern. Empirical and benchmark studies (e.g., SCBench) confirm that sub-O(n) memory reuse methods suffer quality collapse across multi-request scenarios unless they guarantee O(n) memory retention (possibly off-chip), or adopt dynamic sparsity to adapt to unpredictable query-driven attention shifts (Li et al., 2024).

Denominator effects for cost and performance hinge on cache hit rates, locality of reuse, proportion of tokens recomputed, and device-specific I/O and (de)compression bandwidth. Systems must monitor reuse frequency, associated hit rate, and, where relevant, cache eviction and invalidation, adapting compression levels, placement, and cache partitioning accordingly (Li et al., 18 Mar 2025, Feng et al., 28 Aug 2025).

Qualitative best practices emerge:

  • Favor one-time, query-agnostic scoring and eviction (e.g., context reconstruction, maximum-attention, or CCI-informed metrics) over per-query or per-session query-aware schemes to preserve multi-query robustness (Kim et al., 29 May 2025).
  • Combine semantic retrieval, partial recomputation, and compact cache representation to serve multi-tenant, repetitive, or approximate-match workloads efficiently (Yang et al., 17 Mar 2025, Lin et al., 3 Dec 2025).
  • Evaluate not only end-task accuracy but secondary metrics such as TTFT, throughput, Judge Consistency Rate, and per-query cache hit rate under representative workloads (Liang et al., 13 Jan 2026, Li et al., 2024).

Failure modes can include privacy leakage via shifting context recall post-eviction, output drift in judge-centric aggregation, and accuracy loss under unpredictable cross-query attention distribution shifts.

7. Future Research Directions and Open Questions

The rapidly expanding design space for KV cache reuse includes:

  • Principled integration of cross-layer fusion with head-/token-level compression and logic for multi-modal or cross-attentional pathways
  • Dynamic, context- or token-aware adjustment of compression and retention, leveraging runtime attention or external signals (e.g., dynamic layer-wise recomputation as in VLCache, adaptively fusion- or histogram-based token scoring)
  • Interaction with advanced model architectures (Mixture-of-Experts, sliding window attention, SSM layers)
  • Hardware-software co-design for zero-copy slot management, IO-streaming, and cache persistence across DRAM/SSD/network
  • Safeguards and diagnostics for quality invariance in judge-centric, multi-agent, or privacy-sensitive deployments

State-of-the-art methodologies illustrate that substantial memory and throughput improvements—on the order of 2–5×, and, in specialized contexts, up to order-of-magnitude reductions in compute or latency—are achievable while maintaining high-fidelity model outputs, provided that reuse and compression strategies are matched to workload structure, attention distribution, and deployment constraints (Kim et al., 29 May 2025, Ramachandran et al., 1 Oct 2025, Lin et al., 3 Dec 2025, Chen et al., 29 Jul 2025, An et al., 3 Apr 2025, Tang et al., 2024, Roy et al., 7 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KV Cache Reuse.