Papers
Topics
Authors
Recent
Search
2000 character limit reached

KV Cache Growth Control in Transformers

Updated 27 March 2026
  • KV cache growth control is a system of algorithms and strategies that manage the space complexity of key-value caches in Transformer-based models.
  • It employs methods such as token-selective retention, sliding window, and reinforcement learning to balance memory usage and throughput during autoregressive decoding.
  • Advanced techniques including quantization, compression, and hybrid block-wise approaches enable efficient handling of long-context generation in LLMs and multimodal models.

KV cache growth control refers to the set of algorithms, mechanisms, and system strategies designed to bound, compress, or dynamically manage the space complexity of the key-value (KV) cache in Transformer-based models. As context lengths scale into tens or hundreds of thousands of tokens for language, vision, and multimodal models, the linear accumulation of KV pairs during autoregressive decoding becomes the dominant memory and computational bottleneck. Effective KV cache growth control is essential for practical long-context generation, throughput optimization, and resource-efficient inference across modern large models. The following sections present major paradigms, methodologies, theoretical tools, system implementations, and empirical results in KV cache growth control as established by the recent literature.

1. Growth Bottlenecks and Theoretical Framework

The KV cache in Transformer architectures accumulates a new set of key and value vectors per token, per head, and per layer, scaling memory usage as O(L⋅d⋅N)O(L \cdot d \cdot N) for a context of length LL (tokens), hidden size dd, and NN attention heads or layers. In unified autoregressive video models, a single 48-frame 384×672384 \times 672 video can generate >>50K tokens—orders of magnitude above the typical training window of 4K–8K—resulting in end-to-end inference dominated by attention over a massive KV cache (Li et al., 7 Jan 2026). For LLMs, this linear growth rapidly outpaces the parameter size and, if uncontrolled, results in either out-of-memory errors or severe throughput degradation (Liu et al., 8 Aug 2025, Li et al., 7 Jan 2026, Kampeas et al., 6 Jan 2026).

Memory scaling formulas:

Morig(L)=Lâ‹…dâ‹…Nâ‹…BM_{\mathrm{orig}}(L) = L \cdot d \cdot N \cdot B

(BB = bytes per element; e.g., FP16 B=2B=2, FP32 B=4B=4)

Performance strictly depends on both the raw memory footprint and the bandwidth required to load increasingly large K/V matrices at each generation step.

2. Token-Selective and Budgeted Retention Strategies

A core class of methods enforces a global or per-layer hard budget—either as a total token count or per-interval allocation—via token selection and importance-based eviction. These may be static, requiring no model or parameter updates, or adaptive, depending on runtime statistics:

  • Heavy-hitter tracking: Methods such as H2O and SnapKV-D maintain cumulative attention scores per token and greedily evict those receiving the least attention (Liu et al., 12 Dec 2025). SnapKV-D generalizes the SnapKV prefill scheme to long decoding: every ww tokens, it selects cache entries with the highest cumulative windowed attention, preserving the task-critical "heavy hitters." These methods are dominant for reasoning tasks, with SnapKV-D and H2O outperforming alternatives across GSM8K, MATH500, and similar benchmarks for cache budgets B≥256B \geq 256 (Liu et al., 12 Dec 2025).
  • Sliding window: The EvictOldest/FIFO strategy keeps only the most recent NmaxN_{max} tokens, ensuring contiguous cache layout and preserving positional fidelity, at the cost of discarding long-term context (Poudel, 23 Oct 2025).
  • Hybrid block-wise and anchor retention: LASER-KV implements a protection-divisor budgeting scheme, maintaining both a global anchor region (primacy block) and local sliding window per block, with long-term recall budgeted via Exact-LSH selection. This avoids the positional disruption and semantic loss typical in pure recency or attention-based methods, ensuring stable recall at 128k context (Sood et al., 2 Feb 2026).
  • RL-based adaptive eviction: KV Policy (KVP) agents are trained (offline) via reinforcement learning to rank tokens by estimated future utility, achieving state-of-the-art adaptive budgeted eviction with low overhead and strong generalization to unseen domains (Moschella et al., 10 Feb 2026).

3. Quantization, Compression, and Merging

To address dimensionality and redundancy, several studies introduce lossy and lossless compression layers, quantization, or KV merging schemes:

  • Quantization: PackKV applies per-token, low-bit quantization and bit-packing for both K and V, reducing memory usage by an average of 15.3×15.3\times (K) and 18.7×18.7\times (V), while fusing decompression with attention mat-vec computation, resulting in +75+75–170%170\% throughput (Jiang et al., 30 Dec 2025). VQKV applies vector quantization through codebooks, representing each vector by integer indices and reconstructing at decode time, yielding >80%>80\% memory savings and >98%>98\% accuracy retention (Wang et al., 17 Mar 2026).
  • Dimensional compression and KV reuse: KV-CAR compresses K/V via per-layer autoencoders and provides head-wise reuse, storing only distinct representations when similarity exceeds a high threshold, reducing total cache memory by up to 48%48\% on benchmark LLMs (Roy et al., 7 Dec 2025).
  • Merging with compensation: KeepKV merges less-important cache entries (as determined via similarity) into preserved entries, recording electoral votes and mathematically adjusting attention scoring to achieve zero perturbation at the current step. This outperforms naive merging or pruning, maintaining generation quality even at 10%10\% cache budgets with >2×>2\times throughput gain (Tian et al., 14 Apr 2025). ZSMerge merges residual tokens into fixed slots, compensating attention mass and demonstrating $20:1$ compression with minimal impact on quality or throughput even at $54$k context in LLaMA2-7B (Liu et al., 13 Mar 2025).

4. Cross-Frame, Multi-Scale, and Modal-Specific Control

KV cache growth is particularly severe in models handling long sequences in vision and video:

  • Spatiotemporal decay and anchoring: PackCache—deployed for unified video models—allocates persistent budget quotas to "semantic anchors" (prompt/image conditions) and applies exponentially decaying budgets over previous frames, guided by empirically observed attention decay (Li et al., 7 Jan 2026). Frames are compacted or evicted according to their temporal lag, while a spatial positional rebase maintains coherent 3D RoPE structure. PackCache achieves up to 3.7×3.7\times acceleration and a 10%10\% reduction in memory relative to baseline in 48-frame video generation.
  • Multi-scale and layer-aware methods: AMS-KV and ScaleKV, designed for visual autoregressive transformers and next-scale prediction architectures, segment layers into "drafters" (high cache demand, broad attention) and "refiners" (low cache demand, local attention) (Xu et al., 20 Nov 2025, Li et al., 26 May 2025). Early, coarse scales are always cached (condensed scales), and budgets are dynamically assigned via cross-scale KV similarity or per-layer selectivity indices. Memory reductions reach 85%85\% and allow larger batches and higher throughput.
  • Multimodal frequency analysis: FlashCache ranks KV pairs by deviation in frequency domain (rather than attention scores), preserving "outlier" tokens with high criticality for inference. Dynamic per-layer budgets are allocated according to high-frequency "energy," yielding 80%80\% KV memory savings and 1.7×1.7\times decoding speedup in multimodal models (Yang et al., 20 Nov 2025).

5. System-Level and Enterprise Solutions

Scaling KV cache beyond a single GPU or across distributed inference necessitates advanced memory management:

  • LMCache: This enterprise-scale caching layer manages KV storage, movement, and orchestration across GPU, CPU, and storage layers (Cheng et al., 8 Oct 2025). A first-class API exposes pinning, lookup, clear, move, and compress operations, supporting features such as watermark-based eviction, adaptive offloading, reference counting, and hybrid device placement. In combination with vLLM, LMCache achieves up to 15×15\times throughput gains and robust, steady-state memory utilization under high concurrency.
  • Joint encoding for high-concurrency serving: Joint Encoding (Fast-Fusion) fuses similar KV-cache blocks (across requests or input chunks) using high-threshold cosine similarity. Blocks with similarity above uu are merged, shrinks memory by up to 4.38×4.38\times, and substantially boosts throughput without custom hardware or kernel modifications (Kampeas et al., 6 Jan 2026).

6. Task-Adaptive and Semantic-Aware Policies

Task and data domain critically influence optimal cache management:

  • Task-aware compression: DynamicKV adaptively redistributes per-layer budgets based on observed cross-layer activation and task-driven attention patterns, enabling as little as $0.9$–1.7%1.7\% cache retention while maintaining >85%>85\% accuracy on LongBench (Zhou et al., 2024).
  • Semantic- and segment-aware compression: SABlock segments texts into linguistically coherent units, then applies adaptive, budget-driven block-size selection to minimize semantic fragmentation. This yields 99.9%99.9\% retrieval accuracy with only $96$ entries on Needle-in-a-Haystack and 46%46\% peak memory reduction at $128$k context (Chen et al., 26 Oct 2025).
  • Conversational context via episodic partitioning: EpiCache clusters conversation history into topic-based episodes, with block-wise prefill eviction and adaptive budget allocation according to layer sensitivity, yielding $4$–6×6\times memory reduction and 2.4×2.4\times speedups in long conversational QA (Kim et al., 22 Sep 2025).

7. Preservation of Structural Constraints and Practical Guidelines

Multiple studies document the dangers of unprincipled or position-disruptive eviction:

  • Architectural context limits and positional fidelity: Accumulated KV length must not exceed the model's pretrained context window. Non-contiguous pruning (e.g., attention-score-top pruning) can scramble positional encoding (RoPE), sharply degrading output coherence, even with high retention ratios. SlidingWindowGist, which preserves initial contiguous context blocks, offers better coherence at a fraction of the memory (Poudel, 23 Oct 2025).
  • Calibration: Thresholds for memory use, retention, and block size should be set with both model and task requirements in mind, and fidelity metrics (e.g., positional disruption, attention loss) monitored for semantic drift (Poudel, 23 Oct 2025, Sood et al., 2 Feb 2026).

8. Quantitative Results and Comparative Metrics

A non-exhaustive summary of empirical performance:

Method Compression Ratio Speedup Accuracy/Fidelity Drop Domain
PackCache $1.7$–2.2×2.2\times end-to-end, up to 3.7×3.7\times tail up to 3.7×3.7\times <<0.2 FID video (Li et al., 7 Jan 2026)
AMS-KV 6×6\times (85%85\% reduction) 60.5%60.5\% latency <<2\% FID visual auto. (Xu et al., 20 Nov 2025)
KVCrush 4×4\times <0.5%<0.5\% lat. <1%<1\% average accuracy LLM (Jha et al., 24 Feb 2025)
VQKV 82.8%82.8\% memory reduction 4.3×4.3\times longer context <<1\% avg. acc. LLaMA (Wang et al., 17 Mar 2026)
SABlock 46%46\% memory 9.5×9.5\times speed (128k ctx) <<1\% score drop LLM (Chen et al., 26 Oct 2025)
Joint Encoding 4.38×4.38\times 40%40\% throughput <<1\% LLM (Kampeas et al., 6 Jan 2026)
FlashCache 80%80\% memory 1.69×1.69\times decode <<0.2\% accuracy multimodal (Yang et al., 20 Nov 2025)
EpiCache $4$–6×6\times 2.4×2.4\times >>90\% full accuracy retained multi-turn QA (Kim et al., 22 Sep 2025)
DynamicKV 58×58\times (1.7%) – 85%85\% full accuracy LLM (Zhou et al., 2024)

These results confirm that effective KV cache growth control can deliver $4$–20×20\times memory savings, $1.5$–3×3\times throughput improvements, and maintain quality within <1<1–2%2\% of full-cache baselines when parameterized and calibrated appropriately for the task and architecture.


In summary, KV cache growth control is a critical and mature area of research spanning compaction, quantization, semantic retention, learning-based eviction, system architecture, and algorithmic co-design. The literature demonstrates a broad range of effective strategies, with trade-offs between memory, speed, and downstream task fidelity, all underpinned by rigorous empirical and analytical frameworks (Li et al., 7 Jan 2026, Xu et al., 20 Nov 2025, Jha et al., 24 Feb 2025, Wang et al., 17 Mar 2026, Liu et al., 12 Dec 2025, Poudel, 23 Oct 2025, Sood et al., 2 Feb 2026, Kampeas et al., 6 Jan 2026, Liu et al., 8 Aug 2025, Cheng et al., 8 Oct 2025, Kim et al., 22 Sep 2025, Jiang et al., 30 Dec 2025, Chen et al., 26 Oct 2025, Zhou et al., 2024, Roy et al., 7 Dec 2025, Yang et al., 20 Nov 2025, Moschella et al., 10 Feb 2026, Tian et al., 14 Apr 2025, Li et al., 26 May 2025, Liu et al., 13 Mar 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KV Cache Growth Control.