KV Caching in Transformers
- KV caching is a technique that stores intermediate key and value projections to accelerate self-attention in transformer models.
- It enables efficient inference by reducing the quadratic memory and compute costs associated with processing long sequences.
- Advanced strategies such as compression, eviction, quantization, and distributed caching drive significant improvements in speed and energy efficiency.
Key–Value (KV) caching is a central optimization enabling efficient inference in transformer-based LLMs, vision-LLMs, and generative autoregressive architectures more broadly. By storing the intermediate key () and value () projections of previously seen tokens, this technique allows self-attention to be computed with linear rather than quadratic complexity, and forms the basis for a diverse ecosystem of algorithmic and systems-level innovations in memory scalability, cache compression, and high-throughput deployment.
1. Fundamentals of KV Caching in Transformer Architectures
In the canonical transformer decoder, at each decoding step , newly produced token hidden states are projected to keys and values, and appended to layerwise caches , , with denoting total context length and the per-head hidden size. At every step, queries are scored against all past keys: , followed by a softmax and a value-weighted sum. Naively storing all pairs scales both memory and compute quadratically with sequence length, rendering inference memory-bound and potentially intractable for long contexts, especially in high-resolution or multi-modal scenarios (e.g., GUI agents, document understanding, image generation) (Huang et al., 1 Oct 2025).
KV caching amortizes attention computation, ensuring that, after initial prefill, each new token’s attention cost is , and total attention cost over steps per head is . However, as context windows increase, unbounded growth of KV caches presents severe challenges for memory usage, latency, and throughput (Jin et al., 4 Oct 2024, Wang et al., 3 Jun 2025).
2. Compression and Eviction Algorithms for Efficient KV Caching
To mitigate ballooning memory demand from long contexts, contemporary research has developed advanced KV cache eviction and compression mechanisms:
- Layer-aware compression: GUI-KV (Huang et al., 1 Oct 2025) underscores the importance of adapting cache budgets per layer; in GUI agents, empirically uniform attention sparsity motivates a flat per-layer allocation, outperforming non-uniform or pyramidal schemes previously used for natural images.
- Cake-slicing optimization: CAKE (Qin et al., 16 Mar 2025) formalizes cache allocation as a utility-maximizing problem over layer-specific preference scores, derived from the entropy (spatial dispersion) and variance (temporal shift) of recent attention patterns. Allocation proceeds via a cascading, monotonic eviction procedure, ensuring tight adherence to memory constraints and theoretically optimal per-layer slice selection.
- Personalized per-layer schedules: XKV (Li et al., 8 Dec 2024) demonstrates that cache-value importance varies dramatically per layer, and models this as a discrete knapsack problem. A greedy, heap-based allocation achieves optimal per-layer compression, yielding memory reduction and efficiency gains over static uniform methods.
- Redundancy-aware token selection: R-KV (Cai et al., 30 May 2025) and KVCrush (Jha et al., 24 Feb 2025) incorporate both attention-based importance and redundancy (cosine or Hamming similarities in key space), with R-KV in particular pruning highly redundant tokens in chain-of-thought reasoning, enabling over cache reduction with negligible accuracy loss.
- Sparse token and window attention: ALISA (Zhao et al., 26 Mar 2024) introduces Sparse Window Attention (SWA), retaining only a union of locally recent and globally most-attended tokens in the cache, dramatically shrinking KV footprint (to of baseline) while maintaining throughput and accuracy in long-sequence autoregressive settings.
- Semantic compression: SentenceKV (Zhu et al., 1 Apr 2025) restructures the token-level KV cache into semantically-aggregated sentence blocks, storing compact sentence embeddings on GPU while offloading less critical token KVs to CPU. Decoding retrieves only semantically-relevant tokens, preserving accuracy at high compression levels.
- Scale-adaptive policies for multi-scale architectures: AMS-KV (Xu et al., 20 Nov 2025) in visual autoregressive models retains only condensed “global” and recency-based “local” tokens at each scale, guided by inter-scale similarity. This dramatically reduces cache and attention cost in coarse-to-fine image synthesis, with up to KV memory reduction.
3. Quantization and Mixed-Precision Strategies
Quantizing keys and values in the cache is an orthogonal—often synergistic—approach to footprint reduction:
- Joint channel quantization: Coupled Quantization (CQ) (Zhang et al., 7 May 2024) exploits inter-channel dependencies, jointly encoding groups of channels with bits ( bits/channel), enabling effective quantization down to $1$ bit/channel (16 compression) with competitive accuracy.
- Dynamic channelwise precision boost: The Kitty system (Xia et al., 23 Nov 2025) ranks key channels by sensitivity (magnitude heuristic) and selectively boosts the top fraction to $4$ bits, with the remainder quantized to $2$ bits. By maintaining a unified page-centric memory layout and coalesced access (Triton kernels), Kitty preserves the full memory advantage of 2-bit quantization while nearly eliminating the accuracy gap ( drop on challenging reasoning and code tasks).
- Block and value-specific strategies: Offloading non-critical value vectors (KCache (He et al., 28 Apr 2024)) and windowed quantization (Q-Buffer, Sinks in Kitty) further reduce cache cost outside the main GPU memory.
4. Systems, Caching Layers, and Distributed Inference
Modern LLM serving systems integrate KV caching with sophisticated paged, distributed, and offloaded designs:
- Page-chunked KV layers: LMCache (Cheng et al., 8 Oct 2025) introduces a cache-aware, connector-driven layer that exposes KV caches as first-class data structures. Batched, chunked I/O, asynchronous compute–I/O pipelining, and modular connectors enable cross-query cache sharing, prefill–decode disaggregation, and robust cache migration, yielding up to throughput boosts at enterprise scale.
- Workload-aware cache management: Analysis of cloud traces (Wang et al., 3 Jun 2025) reveals that real-world KV reuse patterns are highly skewed, mainly driven by single-turn requests. Probabilistic workload-aware eviction, based on empirical future reuse probability per request type, outperforms classical LRU/LFU/FIFO policies by $1.5$–$23.9$ pp in hit rate, reducing TTFT by $28$–.
- Multi-tenant and cross-agent reuse: KVShare (Yang et al., 17 Mar 2025) and KVCOMM (Ye et al., 14 Oct 2025) generalize prefix cache sharing to multi-tenant and multi-agent deployments. Dual-Stage High Deviation (DHD) algorithms identify token-level cache recomputation requirements via embedding-level and edit distance analyses, orchestrating selection and partial recomputation under tight accuracy constraints. KVCOMM's anchor-pool mechanism aligns KV offsets across diverging agent contexts, achieving cross-agent reuse and $6$– TTFT speedup in collaborative LLM systems.
- Bidirectional compute–I/O scheduling: Cake (Jin et al., 4 Oct 2024) and LMCache (Cheng et al., 8 Oct 2025) recognize that for long prefixes, computing or loading a KV cache from storage are complementary; dynamic “meeting in the middle" algorithms minimize prefill latency, adapting to GPU and I/O availability.
5. Domain- and Hardware-Specific KV Caching
Specialized applications necessitate further adaptation:
- GUI-KV (Huang et al., 1 Oct 2025): For GUI agents with highly redundant visual inputs, spatial saliency scoring (residual -norms) and temporal redundancy projection (QR subspace overlap across frames) enable plug-and-play, training-free cache pruning that not only reduces decoding FLOPs by nearly 40% but actually improves step accuracy over the full-cache baseline.
- Edge devices and memory hierarchy: The Kelle system (Xia et al., 16 Oct 2025) co-designs cache and on-chip memory, leveraging eDRAM’s density with two-dimensional adaptive refresh, selective recomputation, and fine-grained attention-driven eviction. Exploiting the volatility/importance trade-off in bit positions and tokens yields speedup and energy savings on LLaMA2-7B relative to SRAM-only baselines, demonstrating that modest (N’ ≈ 128–512) transient KV budgets with recomputation and adaptive scheduling achieve near-cloud accuracy on resource-constrained hardware.
- Diffusion LMs: In bidirectional diffusion LLMs, KV caching is challenging due to non-monotonic token unmasking. FreeCache (Hu et al., 27 May 2025) introduces block-wise KV approximation: once a block is “clean,” its projections are reused in all further steps, reducing dominant costs to , yielding $2$– speedup with downstream accuracy penalty.
6. Empirical Results and Comparative Evaluations
The empirical impact of KV caching and its variants is substantial across inference regimes, domains, and workloads:
| Method/System | Peak Memory Reduction | Throughput Speedup | Accuracy Drop | Context/Task | Reference |
|---|---|---|---|---|---|
| GUI-KV | 38.9% FLOP↓ | – | GUI-AgentNetBench | (Huang et al., 1 Oct 2025) | |
| CAKE | ( cache) | pts | LongBench (128K ctx) | (Qin et al., 16 Mar 2025) | |
| R-KV | 90% | Reasoning/Math | (Cai et al., 30 May 2025) | ||
| XKV | 61.6% | negligible | LongBench | (Li et al., 8 Dec 2024) | |
| LMCache | up to | $2.1$– | Enterprise LLM | (Cheng et al., 8 Oct 2025) | |
| Kitty | $2.1$– | Reasoning/Code LLM | (Xia et al., 23 Nov 2025) | ||
| Kelle (eDRAM) | energy | negligible | Edge LLM | (Xia et al., 16 Oct 2025) | |
| SentenceKV | 30–40% mem savings | – | PG-19, LongBench | (Zhu et al., 1 Apr 2025) | |
| KVShare | – | BLEU | Multi-tenant LLM | (Yang et al., 17 Mar 2025) | |
| FreeCache (DLM) | – | $2$– | $1$– | Diffusion LMs | (Hu et al., 27 May 2025) |
These results show state-of-the-art cache compression, lossless or near-lossless accuracy on demanding long-context and reasoning tasks, and system-level throughput or TTFT improvements from to over naive or non-caching baselines.
7. Outlook and Future Directions
KV caching has evolved from a basic memory–compute tradeoff mechanism into a rich research area intersecting cache-aware algorithms, distributed and edge deployment, quantization, redundancy elimination, and content-aware policy design.
Current directions include:
- Head- and block-level adaptive allocation beyond the layer granularity (Qin et al., 16 Mar 2025, Huang et al., 1 Oct 2025)
- Tighter integration of hardware features (e.g., memory hierarchy, adaptive refresh, eDRAM/DRAM interplay (Xia et al., 16 Oct 2025))
- Plug-and-play composition of compression, quantization, and workflow-aware policies for multi-agent and retrieval-augmented scenarios (Ye et al., 14 Oct 2025, Pan et al., 10 Jul 2025)
- Extension to non-autoregressive (diffusion, bidirectional, multi-scale) models (Hu et al., 27 May 2025, Xu et al., 20 Nov 2025)
- End-to-end optimization jointly considering cache layout, kernel fusion, I/O scheduling, and pipeline orchestration (Cheng et al., 8 Oct 2025, Jin et al., 4 Oct 2024)
- Fully semantic or hierarchical compression schemes (sentence/paragraph/graph-level) that retain contextual and task-relevant information without reverting to fine-grained token-level heuristics (Zhu et al., 1 Apr 2025)
In sum, KV caching constitutes a critical substrate for scalable, high-performance inference across the expanding landscape of generative AI, and advances in cache-adaptive algorithms and systems will continue to play a pivotal role in realizing the potential of long-context and collaborative foundation models.