Papers
Topics
Authors
Recent
2000 character limit reached

KV Caching in Transformers

Updated 6 December 2025
  • KV caching is a technique that stores intermediate key and value projections to accelerate self-attention in transformer models.
  • It enables efficient inference by reducing the quadratic memory and compute costs associated with processing long sequences.
  • Advanced strategies such as compression, eviction, quantization, and distributed caching drive significant improvements in speed and energy efficiency.

Key–Value (KV) caching is a central optimization enabling efficient inference in transformer-based LLMs, vision-LLMs, and generative autoregressive architectures more broadly. By storing the intermediate key (KK) and value (VV) projections of previously seen tokens, this technique allows self-attention to be computed with linear rather than quadratic complexity, and forms the basis for a diverse ecosystem of algorithmic and systems-level innovations in memory scalability, cache compression, and high-throughput deployment.

1. Fundamentals of KV Caching in Transformer Architectures

In the canonical transformer decoder, at each decoding step tt, newly produced token hidden states are projected to keys and values, and appended to layerwise caches KRT×dK \in \mathbb{R}^{T \times d}, VRT×dV \in \mathbb{R}^{T \times d}, with TT denoting total context length and dd the per-head hidden size. At every step, queries QQ are scored against all past keys: QKR1×TQ K^\top \in \mathbb{R}^{1 \times T}, followed by a softmax and a value-weighted sum. Naively storing all K,VK,V pairs scales both memory and compute quadratically with sequence length, rendering inference memory-bound and potentially intractable for long contexts, especially in high-resolution or multi-modal scenarios (e.g., GUI agents, document understanding, image generation) (Huang et al., 1 Oct 2025).

KV caching amortizes attention computation, ensuring that, after initial prefill, each new token’s attention cost is O(Td)O(T d), and total attention cost over TT steps per head is O(T2d)O(T^2 d). However, as context windows increase, unbounded growth of KV caches presents severe challenges for memory usage, latency, and throughput (Jin et al., 4 Oct 2024, Wang et al., 3 Jun 2025).

2. Compression and Eviction Algorithms for Efficient KV Caching

To mitigate ballooning memory demand from long contexts, contemporary research has developed advanced KV cache eviction and compression mechanisms:

  • Layer-aware compression: GUI-KV (Huang et al., 1 Oct 2025) underscores the importance of adapting cache budgets per layer; in GUI agents, empirically uniform attention sparsity motivates a flat per-layer allocation, outperforming non-uniform or pyramidal schemes previously used for natural images.
  • Cake-slicing optimization: CAKE (Qin et al., 16 Mar 2025) formalizes cache allocation as a utility-maximizing problem over layer-specific preference scores, derived from the entropy (spatial dispersion) and variance (temporal shift) of recent attention patterns. Allocation proceeds via a cascading, monotonic eviction procedure, ensuring tight adherence to memory constraints and theoretically optimal per-layer slice selection.
  • Personalized per-layer schedules: XKV (Li et al., 8 Dec 2024) demonstrates that cache-value importance varies dramatically per layer, and models this as a discrete knapsack problem. A greedy, heap-based allocation achieves optimal per-layer compression, yielding 61.6%61.6\% memory reduction and 2.1×2.1\times efficiency gains over static uniform methods.
  • Redundancy-aware token selection: R-KV (Cai et al., 30 May 2025) and KVCrush (Jha et al., 24 Feb 2025) incorporate both attention-based importance and redundancy (cosine or Hamming similarities in key space), with R-KV in particular pruning highly redundant tokens in chain-of-thought reasoning, enabling over 90%90\% cache reduction with negligible accuracy loss.
  • Sparse token and window attention: ALISA (Zhao et al., 26 Mar 2024) introduces Sparse Window Attention (SWA), retaining only a union of locally recent and globally most-attended tokens in the cache, dramatically shrinking KV footprint (to 20%\sim 20\% of baseline) while maintaining throughput and accuracy in long-sequence autoregressive settings.
  • Semantic compression: SentenceKV (Zhu et al., 1 Apr 2025) restructures the token-level KV cache into semantically-aggregated sentence blocks, storing compact sentence embeddings on GPU while offloading less critical token KVs to CPU. Decoding retrieves only semantically-relevant tokens, preserving accuracy at high compression levels.
  • Scale-adaptive policies for multi-scale architectures: AMS-KV (Xu et al., 20 Nov 2025) in visual autoregressive models retains only condensed “global” and recency-based “local” tokens at each scale, guided by inter-scale similarity. This dramatically reduces cache and attention cost in coarse-to-fine image synthesis, with up to 84.8%84.8\% KV memory reduction.

3. Quantization and Mixed-Precision Strategies

Quantizing keys and values in the cache is an orthogonal—often synergistic—approach to footprint reduction:

  • Joint channel quantization: Coupled Quantization (CQ) (Zhang et al., 7 May 2024) exploits inter-channel dependencies, jointly encoding groups of cc channels with bb bits (b/cb/c bits/channel), enabling effective quantization down to $1$ bit/channel (16×\times compression) with competitive accuracy.
  • Dynamic channelwise precision boost: The Kitty system (Xia et al., 23 Nov 2025) ranks key channels by sensitivity (magnitude heuristic) and selectively boosts the top ff fraction to $4$ bits, with the remainder quantized to $2$ bits. By maintaining a unified page-centric memory layout and coalesced access (Triton kernels), Kitty preserves the full 8×8\times memory advantage of 2-bit quantization while nearly eliminating the accuracy gap (<1%<1\% drop on challenging reasoning and code tasks).
  • Block and value-specific strategies: Offloading non-critical value vectors (KCache (He et al., 28 Apr 2024)) and windowed quantization (Q-Buffer, Sinks in Kitty) further reduce cache cost outside the main GPU memory.

4. Systems, Caching Layers, and Distributed Inference

Modern LLM serving systems integrate KV caching with sophisticated paged, distributed, and offloaded designs:

  • Page-chunked KV layers: LMCache (Cheng et al., 8 Oct 2025) introduces a cache-aware, connector-driven layer that exposes KV caches as first-class data structures. Batched, chunked I/O, asynchronous compute–I/O pipelining, and modular connectors enable cross-query cache sharing, prefill–decode disaggregation, and robust cache migration, yielding up to 15×15\times throughput boosts at enterprise scale.
  • Workload-aware cache management: Analysis of cloud traces (Wang et al., 3 Jun 2025) reveals that real-world KV reuse patterns are highly skewed, mainly driven by single-turn requests. Probabilistic workload-aware eviction, based on empirical future reuse probability per request type, outperforms classical LRU/LFU/FIFO policies by $1.5$–$23.9$ pp in hit rate, reducing TTFT by $28$–42%42\%.
  • Multi-tenant and cross-agent reuse: KVShare (Yang et al., 17 Mar 2025) and KVCOMM (Ye et al., 14 Oct 2025) generalize prefix cache sharing to multi-tenant and multi-agent deployments. Dual-Stage High Deviation (DHD) algorithms identify token-level cache recomputation requirements via embedding-level and edit distance analyses, orchestrating selection and partial recomputation under tight accuracy constraints. KVCOMM's anchor-pool mechanism aligns KV offsets across diverging agent contexts, achieving 70%70\% cross-agent reuse and $6$–8×8\times TTFT speedup in collaborative LLM systems.
  • Bidirectional compute–I/O scheduling: Cake (Jin et al., 4 Oct 2024) and LMCache (Cheng et al., 8 Oct 2025) recognize that for long prefixes, computing or loading a KV cache from storage are complementary; dynamic “meeting in the middle" algorithms minimize prefill latency, adapting to GPU and I/O availability.

5. Domain- and Hardware-Specific KV Caching

Specialized applications necessitate further adaptation:

  • GUI-KV (Huang et al., 1 Oct 2025): For GUI agents with highly redundant visual inputs, spatial saliency scoring (residual 2\ell_2-norms) and temporal redundancy projection (QR subspace overlap across frames) enable plug-and-play, training-free cache pruning that not only reduces decoding FLOPs by nearly 40% but actually improves step accuracy over the full-cache baseline.
  • Edge devices and memory hierarchy: The Kelle system (Xia et al., 16 Oct 2025) co-designs cache and on-chip memory, leveraging eDRAM’s density with two-dimensional adaptive refresh, selective recomputation, and fine-grained attention-driven eviction. Exploiting the volatility/importance trade-off in bit positions and tokens yields 3.94×3.94\times speedup and 4.46×4.46\times energy savings on LLaMA2-7B relative to SRAM-only baselines, demonstrating that modest (N’ ≈ 128–512) transient KV budgets with recomputation and adaptive scheduling achieve near-cloud accuracy on resource-constrained hardware.
  • Diffusion LMs: In bidirectional diffusion LLMs, KV caching is challenging due to non-monotonic token unmasking. FreeCache (Hu et al., 27 May 2025) introduces block-wise KV approximation: once a block is “clean,” its projections are reused in all further steps, reducing dominant O(L2)\mathcal{O}(L^2) costs to O(LB)\mathcal{O}(LB), yielding $2$–5×5\times speedup with <2%<2\% downstream accuracy penalty.

6. Empirical Results and Comparative Evaluations

The empirical impact of KV caching and its variants is substantial across inference regimes, domains, and workloads:

Method/System Peak Memory Reduction Throughput Speedup Accuracy Drop Context/Task Reference
GUI-KV 38.9% FLOP↓ +4.1%+4.1\% GUI-AgentNetBench (Huang et al., 1 Oct 2025)
CAKE >95%>95\% (3.2%3.2\% cache) >10×>10\times <0.3<0.3pts LongBench (128K ctx) (Qin et al., 16 Mar 2025)
R-KV 90% 6.6×6.6\times <1%<1\% Reasoning/Math (Cai et al., 30 May 2025)
XKV 61.6% 2.1×2.1\times negligible LongBench (Li et al., 8 Dec 2024)
LMCache up to 15×15\times $2.1$–4.1×4.1\times <0.5%<0.5\% Enterprise LLM (Cheng et al., 8 Oct 2025)
Kitty 8×8\times $2.1$–4.1×4.1\times <1%<1\% Reasoning/Code LLM (Xia et al., 23 Nov 2025)
Kelle (eDRAM) 4.5×4.5\times energy 3.94×3.94\times negligible Edge LLM (Xia et al., 16 Oct 2025)
SentenceKV 30–40% mem savings <0.5%<0.5\% PG-19, LongBench (Zhu et al., 1 Apr 2025)
KVShare 1.2×1.2\times +0.8+0.8 BLEU Multi-tenant LLM (Yang et al., 17 Mar 2025)
FreeCache (DLM) $2$–5×5\times $1$–2%2\% Diffusion LMs (Hu et al., 27 May 2025)

These results show state-of-the-art cache compression, lossless or near-lossless accuracy on demanding long-context and reasoning tasks, and system-level throughput or TTFT improvements from 2×2\times to 15×15\times over naive or non-caching baselines.

7. Outlook and Future Directions

KV caching has evolved from a basic memory–compute tradeoff mechanism into a rich research area intersecting cache-aware algorithms, distributed and edge deployment, quantization, redundancy elimination, and content-aware policy design.

Current directions include:

In sum, KV caching constitutes a critical substrate for scalable, high-performance inference across the expanding landscape of generative AI, and advances in cache-adaptive algorithms and systems will continue to play a pivotal role in realizing the potential of long-context and collaborative foundation models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to KV Caching.