KV-Cache in Transformer Models

Updated 18 March 2026

KV-cache is a storage mechanism in transformer models that caches key and value vectors to streamline autoregressive inference.
It minimizes redundant computations by reusing cached contextual representations, enhancing performance in NLP, vision, and multimodal tasks.
Research focuses on adaptive retention, quantization, and redundancy pruning to tackle memory and bandwidth bottlenecks in large-scale models.

A key-value cache (“KV-cache”) in the context of transformer-based deep learning models refers to the data structure that stores the “key” and “value” vectors produced at each layer and timestep during autoregressive inference. This cache enables efficient deployments of large-scale models for tasks including natural language generation, vision-language modeling, multimodal reasoning, and video understanding, by preventing redundant recomputation of historical contextual representations. As model context lengths and batch sizes grow, the KV-cache rapidly becomes the dominant source of memory consumption and inference bandwidth, motivating a diverse ecosystem of algorithmic and systems-level approaches to compress, allocate, and repurpose KV-caches.

1. KV-cache Fundamentals and Core Bottlenecks

In transformer models, for each decoding step or token position $t$ , each attention layer computes a query vector $Q_t$ , a key vector $K_t$ , and a value vector $V_t$ from the hidden state $h_{t-1}$ . For autoregressive generation, every new $Q_t$ attends to all previous $K_{1:t}$ and $V_{1:t}$ elements. Rather than recomputing all past projections at every step, the KV-cache maintains these key and value vectors in memory, typically as $(L \times T \times d)$ -shaped tensors, where $L$ = number of layers, $Q_t$ 0 = number of tokens, $Q_t$ 1 = hidden dimension or per-head dimension.

The memory and communication overhead of the KV-cache grows linearly with $Q_t$ 2, $Q_t$ 3, and $Q_t$ 4—often exceeding model weights for long-context and/or high-resolution (e.g., video) workloads. Particularly in vision-LLMs (VLMs), large prompts with thousands of visual tokens result in KV-caches spanning hundreds of GB on modern hardware (Tu et al., 2024, Tao et al., 20 Mar 2025). The computational cost of attention also rises, as each decoding step must aggregate over this growing cache.

2. Adaptive Retention and Allocation Strategies

Adaptive retention strategies address the observation that the importance of cached tokens is highly non-uniform across layers, heads, input modalities, or even across specific tasks. Uniform strategies—keeping the same number or fraction of cached tokens per layer—are generally suboptimal. Recent research emphasizes several principles:

Layer-wise heterogeneity: Lorenz curves and Gini coefficients show radically different importance concentration patterns in different layers (“concentrated” versus “dispersed”) (Wang et al., 2024). Approaches such as PrefixKV perform a layer-wise search for the optimal prefix (retention) configuration under a global memory budget, adaptively maximizing cumulative attention mass in each layer via binary search over a global importance threshold.
Task-adaptivity: Task-KV and DynamicKV dynamically adjust per-head or per-layer retention based on semantic diversity, activation patterns, or specific task requirements (He et al., 25 Jan 2025, Zhou et al., 2024). For example, Task-KV identifies “heterogeneous” heads (far from the semantic center) that contribute disproportionately to some tasks and assigns them larger budgets, while compressing “non-heterogeneous” heads more aggressively. DynamicKV periodically reallocates retention based on observed layer-wise attention distributions, robustly tracking shifting demands across QA, summarization, or code tasks.
Modal and structural awareness: In VLMs, visual tokens and text tokens exhibit different attention and sparsity profiles (Tu et al., 2024). VL-Cache distributes retention adaptively across layers and modalities, measuring post-vision prompt attention and prizing tokens critical for downstream language decoding.
Windowing and memory-efficient policies: GUI-KV, AMS-KV, and similar approaches exploit application-specific statistical structure—such as strong local scale dependence in multi-scale image transformers or high GUI frame redundancy—to prune redundant cache entries (e.g., by retaining only a short local window, a set of “condensed” global scales, or the most non-redundant keys in a video sequence) (Xu et al., 20 Nov 2025, Huang et al., 1 Oct 2025).

3. Quantization and Low-Rank Compression

Quantization and linear dimensionality reduction attack the KV-cache memory bottleneck by reducing the bit-width and effective rank of stored vectors:

Aggressive quantization: Methods such as Coupled Quantization (CQ) take advantage of the mutual information among key/value channels to enable quantization to as little as 1 bit per channel, outperforming prior independent per-channel approaches—especially when channels are “coupled” using block-based codebooks initialized by k-means clustering (Zhang et al., 2024). VidKV pushes video model cache quantization further with mixed-precision, frequency-domain-aware quantizers, and finds that per-channel quantization of value caches is more robust than per-token quantization (Tao et al., 20 Mar 2025).
Low-rank and transform coding: PCA- or SVD-based schemes factor the cache, or key/value projections, into low-rank subspaces—retaining only the most information-rich components. Approach variants include KQ-SVD (optimal attention-matrix low-rank approximation) (Lesens et al., 5 Dec 2025), ReCalKV’s grouped SVD with head reordering and offline calibration (Yan et al., 30 May 2025), and transform-coding pipelines such as KVTC (PCA decorrelation + quantization + entropy coding) enabling up to 20–40× lossless cache compression (Staniszewski et al., 3 Nov 2025). These methods can be enhanced by postprocessing (e.g., matrix fusion, fused output projections) to ensure no runtime inference overhead.

4. Redundancy and Importance-Aware Pruning

Token-wise redundancy and semantic similarity are pervasive in long-chain reasoning, video, and GUI contexts. Efficient cache management thus increasingly leverages:

Redundancy scoring: R-KV compresses chain-of-thought caches by scoring tokens using both their attention-based importance and their redundancy (cosine similarity to others), achieving 10× reduction with nearly no quality loss in math reasoning LLMs (Cai et al., 30 May 2025). DeltaKV encodes only the residual of each token relative to retrieved historical references, leveraging global similarity and shared latent structure for further compression (Hao et al., 8 Feb 2026).
Frequency-domain and “outlier” detection: FlashCache applies a DCT-based low-pass filter to KV tensors, identifies tokens whose energy deviates from the principal (low-frequency) trend as “outliers,” and designates these for retention, all without attention-matrix recomputation (natively compatible with FlashAttention) (Yang et al., 20 Nov 2025).
Saliency mixing: GUI-KV blends attention-score and hidden-state norm signals to prioritize visually and semantically important tokens (spatial saliency), while projecting older frames into the present frame’s key subspace to erase redundant history (temporal redundancy scoring) (Huang et al., 1 Oct 2025).

5. Systems Co-Design and Efficient Serving

Resource constraints and deployment realities motivate KV-cache management mechanisms at the system level:

Fused/accelerated kernels: Systems like KVComp fuse block-wise quantization, shared-memory Huffman encoding, and attention mat-vec computation to minimize memory traffic and deliver up to 6.7× cache compression, often matching or surpassing standard cuBLAS performance on large-context workloads (Jiang et al., 30 Aug 2025).
Efficient cache loading and offloading: Prefix caching enables cross-turn KV reuse but can introduce I/O bottlenecks. The Cake loader dynamically races between on-GPU recomputation and storage I/O to minimize time-to-first-token, shifting the merge point adaptively with no manual tuning (Jin et al., 2024).
Modular memory management: Engines like Sparse-vLLM decouple logical and physical cache layout, facilitating irregular, hybrid, or cross-layer cache allocations in response to model- or task-specific needs (Hao et al., 8 Feb 2026).

6. Emerging Uses: Representation Reuse and Reasoning Control

Recent findings demonstrate additional utility for the KV-cache beyond acceleration:

Representation reuse: The cached key and value tensors themselves encode rich contextual state; they can be “flattened” and pooled to form lightweight per-token embeddings (KV-CoE) for confidence estimation or plug-in representation for downstream tasks, matching or exceeding dedicated embedding evaluations (Xing et al., 28 Jan 2026).
Adaptive reasoning and resource control: Fast/slow reasoning switching can be triggered by inspecting pooled KV activations to estimate instance difficulty and modulate the depth or verbosity of chain-of-thought traces (KVClassifier), significantly reducing token consumption with negligible accuracy loss (Xing et al., 28 Jan 2026).

A table below summarizes typical memory reductions and accuracy tradeoffs achieved by leading approaches:

Method/Reference	Typical Memory Reduction	Accuracy Degradation (Task/Model)
PrefixKV (Wang et al., 2024)	80% (20% retention)	~0.1–0.3 PPL (LLaVA, Qwen)
R-KV (Cai et al., 30 May 2025)	90% (10% retention)	<1% (MATH-500, AIME)
VidKV (Tao et al., 20 Mar 2025)	5–6× (1.5–1.66 bits)	1–2% (Video-LLMs, GPT-score)
DynamicKV (Zhou et al., 2024)	98% (1.7% retention)	~10–15% (LongBench, varies)
KeepKV (Tian et al., 14 Apr 2025)	90% (10% retention)	<2 points (ROUGE, QA, LongBench)
DeltaKV (Hao et al., 8 Feb 2026)	71% (29% retention)	<0.5 absolute (LongBench)
KVTC (Staniszewski et al., 3 Nov 2025)	20–40×	±1 point (GSM8K, MMLU, Code)
GUI-KV (Huang et al., 1 Oct 2025)	38.9% decoding FLOPs	−4.1% to +4.1% (UI tasks)
ReCalKV (Yan et al., 30 May 2025)	50–70%	<2–5 points (QA, LongBench)

7. Limitations and Open Challenges

Despite dramatic progress, the field faces persistent challenges:

Information loss at extreme compression: For budgets below 1–2%, even adaptive schemes exhibit sharp degradation for tasks requiring preservation of long-range or rare context. Frequency-domain and redundancy-aware metrics soften but do not eliminate this effect (Yang et al., 20 Nov 2025).
Integration with efficient attention kernels: Many importance scoring methods require explicit attention matrices, which are unavailable in FlashAttention/SparseAttention. Approaches that leverage low-layer proxy scores, value-norms, or frequency-domain statistics (FlashCache, PureKV) provide a way forward (Jiang et al., 29 Oct 2025, Yang et al., 20 Nov 2025).
Task and prompt generality: Static compression/quantization parameters or retention allocations may underperform on out-of-distribution prompts or highly domain-specific tasks. Dynamic, feedback-driven, or learned adaptive policies (DynamicKV, Task-KV) offer improved generalization but can incur tuning and system complexity (Zhou et al., 2024, He et al., 25 Jan 2025).
Fusion and composability: Combining quantization, pruning, low-rank approximation, and cache reuse in a single pipeline, while avoiding compounded accuracy loss or interaction artifacts, is a major engineering and theoretical question.
Scalability and deployment: Serving engines must support heterogeneous, dynamic, and non-contiguous cache layouts, fuse compression with attention, and adapt automatically to resource constraints—necessitating modular and hardware-aware design (Hao et al., 8 Feb 2026, Jiang et al., 30 Aug 2025).

In conclusion, KV-cache management is now a core subfield of efficient transformer inference. It encompasses a growing set of principled, theoretically justified, and empirically validated algorithms attacking memory, bandwidth, and accuracy bottlenecks across modalities and applications. The ongoing interplay between model design, compression theory, and high-performance systems continues to drive rapid advances, as summarized in the cited work above.