Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 94 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 26 tok/s Pro
GPT-4o 92 tok/s
GPT OSS 120B 452 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Efficient KV Caching for LLM Inference

Updated 17 July 2025
  • Efficient KV Caching is a suite of algorithmic and systems techniques that reduces memory and computational overhead in transformer-based models.
  • It leverages methods such as quantization, adaptive pruning, and merging to compress key–value states for improved throughput.
  • Hybrid CPU/GPU management and workflow-aware policies further optimize caching efficiency for long-sequence and multi-tenant deployments.

Efficient KV caching refers to the suite of algorithmic, architectural, and systems-level techniques designed to reduce the memory and computational overhead of storing and manipulating the key–value (KV) states in transformer-based LLMs during inference. As the context window, batch size, and model size scale, conventional KV caching can quickly become the main bottleneck, imposing linear memory scaling across tokens, layers, and heads and limiting practical deployment of LLMs for long-sequence and high-throughput applications. Recent advances draw on quantization, pruning, cache sharing, hybrid CPU/GPU management, hierarchical reuse, and workflow-aware policies to address diverse usage patterns and system constraints, enabling orders-of-magnitude improvements in memory footprint, throughput, and latency while preserving model quality.

1. Bottlenecks of Standard KV Caching

In standard autoregressive transformer inference, the KV cache persists the projected key and value vectors for each layer and token. The total memory grows as

KVCache Memory=2×(bytes per element)×b×s×d×l\text{KVCache Memory} = 2 \times (\text{bytes per element}) \times b \times s \times d \times l

with bb = batch size, ss = sequence length, dd = embedding dimension, and ll = layer count (He et al., 28 Apr 2024). This linear scaling in ss and bb means that for long documents or multi-turn interaction, KV cache usage can exceed available GPU high-bandwidth memory (HBM), leading to thrashing or costly device–host data movement. Approaches that offload to CPU incur further bandwidth bottlenecks, as the PCIe or NVLink connection cannot keep up with random per-token accesses unless specialized techniques are used (Jiang et al., 26 Nov 2024). The KV cache remains one of the chief obstacles in servicing large-batch, long-context LLMs at scale.

2. Core Compression and Pruning Methods

Quantization

Quantization reduces the bitwidth of KV entries. Coupled Quantization (CQ) jointly compresses multiple contiguous channels, exploiting the fact that the joint entropy of channel groups grows sublinearly with the number of channels; as a result, CQ achieves low-perplexity, near-lossless KV caches with as little as 1 bit per floating-point channel by learning centroids for multi-channel codebooks (Zhang et al., 7 May 2024). Heterogeneous quantization, as introduced in LeanKV, stores keys at higher precision (e.g., 8 bits) and values at lower precision (e.g., 4 bits), since keys influence shared softmax denominators and are more sensitive to quantization error (Zhang et al., 4 Dec 2024). AQUA-KV predicts the value of KV pairs across layers using lightweight regressors, quantizing only the unpredictable residual, achieving 2–2.5 bits per value while keeping perplexity within 1% of the baseline (Shutova et al., 31 Jan 2025).

Adaptive and Saliency-Aware Pruning

Adaptive pruning determines which KV entries to retain based on attention scores or contextual relevance. Methods such as ZipCache leverage normalized attention scores—corrected for the lower-triangular causal structure—to better identify salient tokens and apply higher-precision quantization or retention to those, while aggressively compressing less important ones (He et al., 23 May 2024). DynamicKV periodically computes global and per-layer token budgets based on attention patterns, adaptively retaining the most relevant tokens for the model's current layer and specific task, attaining 85% of full-KV performance at only 1.7% of the original memory (Zhou et al., 19 Dec 2024).

Merging and Consistency-Preserving Compression

Recent work highlights the problem of "output perturbation": merging or aggressive pruning can misalign model outputs by distorting the attention distribution. KeepKV introduces an electoral votes mechanism to track the merging history, adjusting attention scores so that a merged cache yields exactly the same output as the original at each inference step. The ZIP-Merging formulation guarantees that oscillations in influence due to merging are mathematically compensated (Tian et al., 14 Apr 2025).

3. Cache Organization and Reuse Strategies

Cross-Layer and Depth Sharing

MiniCache and KVSharer target cross-layer redundancy: in middle and deep layers of transformers, KV states of adjacent layers are often highly similar. MiniCache merges such states after decomposing them into magnitude and direction (via â„“2\ell_2 norm and spherical interpolation, SLERP), with a token retention threshold to avoid semantically divergent merges (Liu et al., 23 May 2024). KVSharer further discovers that sharing dissimilar (rather than similar) KV caches across layers can improve compression with minimal accuracy loss, challenging conventional sharing intuition (Yang et al., 24 Oct 2024).

Structural Compression and Tree-Based Eviction

TreeKV employs a tree-structured cache with "sparse left, dense right" topology, supported by discrete wavelet analysis. Its cyclic eviction scopes sweep sequentially, evicting the lower-weight token among adjacent pairs. This ensures smooth and balanced retention across contexts, allowing small, fixed-size caches to maintain both global context and recency, generalizing pre-trained models to much longer contexts (up to 16x cache reduction) while outperforming baselines in perplexity and QA tasks (He et al., 9 Jan 2025).

Semantic and Workflow-Aware Strategies

SentenceKV introduces a semantically grouped caching framework: during prefill, tokens are bucketed into sentence-level units, each represented by a mean key vector. During decoding, only sentence buckets relevant (via semantic similarity of query) to the current token are retrieved. This approach achieves a stable inference latency and 30–97% reduction in memory usage across various benchmarks, with high retrieval accuracy on the Needle-In-A-Haystack test (Zhu et al., 1 Apr 2025).

For multi-agent workflows, KVFlow employs a tree-structured cache and a steps-to-execution metric to prioritize caching entries likely to be needed soon based on the dynamically updated Agent Step Graph. It improves over LRU by up to 2.19x in concurrent workflow settings and 1.83x in large-prompt single workflows by combining workflow-aware eviction and proactive background prefetch (Pan et al., 10 Jul 2025).

4. Hybrid and I/O-Aware Cache Management

When KV cache offloading to CPU becomes necessary, I/O-aware strategies are essential. KVPR analyzes the tradeoff between recomputing a portion of the KV cache on the GPU (from transferred activations XX) versus transferring large KV blocks. By formulating a linear program that finds the optimal split point per layer, KVPR achieves up to 35.8% latency and 46.2% throughput improvements over standard offloading, overlapping computation with transfer and reducing PCIe bottlenecks (Jiang et al., 26 Nov 2024). Such systems are fully automated by combining profiler, scheduler, and runtime modules.

Efficient distributed cache management is highlighted in the context of prefix prefill for RAG and agentic workloads. Real trace analyses show high cache reusability, predominantly sequential access with some random block lookups, and the need for dedicated metadata structures. Both (Zhu et al., 28 May 2025) and (Wang et al., 3 Jun 2025) propose hierarchical and workload-aware management, including customized eviction policies based on fitted reuse probability (using exponential distributions on per-workload categories) and block offset. These achieve near-optimal hit ratios with moderate cache sizes (close to GPU HBM scale), lowering time-to-first-token (TTFT) by up to 41.4% compared with LRU or LFU.

5. Multi-Tenant and Contextual KV Cache Reuse

KV reuse strategies leverage common prefixes or semantically similar prompts across users, which is especially beneficial in cloud or large batch settings. KVLink formalizes this for RAG scenarios: document-level KV caches are precomputed and adjusted at inference by recalculating positional encoding to fit the global context and introducing trainable inter-document link tokens. Question answering accuracy improves by 4% and TTFT is reduced by as much as 96% (Yang et al., 21 Feb 2025). KVShare extends this to multi-tenant LLM deployments. It uses a Dual-Stage High Deviation semantic alignment algorithm to select and edit cached KV pairs from past requests, replacing only what differs (with placeholder tokens recomputed as needed). Relative to full recomputation, KVShare yields up to 9.39x faster TTFT and 1.2x throughput improvements while maintaining high accuracy (Yang et al., 17 Mar 2025).

ECHO-LLaMA demonstrates that for pretraining and finetuning, sharing KV caches across higher layers can reduce computational cost and memory by up to 77% during training and yield up to 7% higher inference throughput and up to 14% lower loss, with minimal or positive effect on final language performance (Dialameh et al., 22 May 2025).

6. Impact, Limitations, and Future Directions

Efficient KV caching techniques collectively enable LLM deployment with dramatically lower memory footprints, higher throughput, and reduced latency, even when scaling to million-token contexts or heavy multi-tenant cloud loads. The techniques are orthogonal and composable: quantization/pruning may be combined with hybrid I/O management and reuse-aware scheduling.

Potential limitations include sensitivity in output consistency for naive merging or pruning (addressed by mechanisms such as KeepKV’s ZIP-Merging (Tian et al., 14 Apr 2025)), the need for workload-dependent parameter calibration, and the challenge of predictive metadata and prefetching under nonstationary or adversarial request patterns.

A promising direction is task-adaptive, runtime-dynamic strategies, as seen in DynamicKV (Zhou et al., 19 Dec 2024), and workload-aware eviction policies that combine signals from temporal locality, spatial offsets, and semantic similarity (Wang et al., 3 Jun 2025). The blending of semantic grouping (e.g., SentenceKV) with per-layer and per-token quantization/pruning, and the integration of workflow semantics into cache management (e.g., in KVFlow), anticipate the next phase of scalable and adaptive infrastructure for high-throughput LLM inference.

7. Summary Table: Prominent Efficient KV Caching Approaches (2024–2025)

Method/Framework Main Innovation Typical Compression Ratio Throughput/Latency Gains Key Features
Coupled Quantization (Zhang et al., 7 May 2024) Multi-channel codebooks, joint entropy ≤1 bit/channel Maintains baseline accuracy Fisher-guided k-means centroids, dense quantization
ZipCache (He et al., 23 May 2024) Channel-separable, normalized attention ~5× Latency −57%, memory −20% Saliency-based adaptive quantization; FlashAttention compatible
MiniCache (Liu et al., 23 May 2024) Cross-layer (depth) merging ~5× 5× throughput, −41% memory SLERP interpolation, token retention threshold
RazorAttention (Tang et al., 22 Jul 2024) Retrieval-head aware, compensation token >70% reduction Negligible accuracy loss Plug-and-play, headwise cache, FlashAttention compatible
KVSharer (Yang et al., 24 Oct 2024) Dissimilar layer-wise sharing ~30% KV reduction ≥1.3× acceleration Complementary to intra-layer methods
KVPR (Jiang et al., 26 Nov 2024) I/O-aware partial recomputation Offload-adaptive −36% latency, +46% throughput Linear programming scheduler, recompute+transfer overlap
LeanKV (Zhang et al., 4 Dec 2024) Hetero-KV, dynamic per-head adaptation 3–5× (up to 11×) Up to 6.9× throughput Unified quantization+pruning; page coalescing allocator
TreeKV (He et al., 9 Jan 2025) Tree-structured, wavelet-motivated 16× — Cyclic eviction, supports generation/prefill, generalizes to longer contexts
AQUA-KV (Shutova et al., 31 Jan 2025) Layer-predictive residual quantization 2–2.5 bits/val Near-lossless, fast calibration One-shot, works with existing quantizer backends
KVLink (Yang et al., 21 Feb 2025) Precomputed doc caches + position/link fix — −96% TTFT Re-rotated positions, link tokens, supports compression
KVShare (Yang et al., 17 Mar 2025) Cross-request edit+reuse, DELTA Tree 60–90% token reuse −9.39× TTFT, +1.2× throughput Dual-stage high deviation, PartialAttention, scheduler
SentenceKV (Zhu et al., 1 Apr 2025) Sentence-level semantic grouping 30–97% memory reduction Constant latency (256K context) Mean-key vectors, semantic retrieval, token budgeted load
KeepKV (Tian et al., 14 Apr 2025) Zero perturbation merging, electoral votes ~10× memory reduction >2× throughput Attention-preserving merge, strict output consistency
ECHO-LLaMA (Dialameh et al., 22 May 2025) Training-side layer cache sharing up to 50% compute saved +7% test-thr., −14% loss (train) Gradual adaptation, competitive downstream metrics
KVFlow (Pan et al., 10 Jul 2025) Agent step graph workflow caching — 1.83–2.19× speedup in agents Steps-to-execution, prefetching, fine-grained eviction

These developments collectively chart the landscape of efficient KV caching research, each addressing distinct aspects of memory, bandwidth, workload patterns, and inference modality in LLM deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube