KV Cache Learning

Updated 25 February 2026

KV Cache Learning is a set of methods that adaptively manage key-value states in transformer decoders to optimize memory usage and inference efficiency.
It utilizes techniques such as RL-based head selection, global token scoring, and dynamic eviction to compress caches with minimal performance loss.
These approaches enable scalable LLM inference by balancing memory savings with high-quality reasoning and chain-of-thought generation.

Key-Value (KV) Cache Learning refers to a growing body of methods for adaptively controlling the size, retention, content, and reuse of the key-value states stored during transformer-based decoder inference. KV caches are critical for enabling efficient autoregressive generation in LLMs, but their linear growth with sequence length and batch size imposes major memory and throughput bottlenecks, especially in long-context or high-throughput inference. KV cache learning encompasses algorithmic, optimization, and learning-based strategies—often employing reinforcement learning, combinatorial optimization, and representation learning—to maximize memory savings while minimizing degradation in model performance, particularly for reasoning and chain-of-thought applications.

1. Fundamental Problem: Scaling and Bottlenecks of KV Caching

The transformer decoder architecture requires caching all previously computed key-value vectors for attention over the sequence. For each input token, KV cache memory grows as $\mathcal{O}(L \cdot d)$ (where $L$ is sequence length and $d$ is embedding dimension), rapidly leading to multi-gigabyte per-instance footprints in large models and limiting batch size, context window, and even inference feasibility on current hardware. Computational cost further increases quadratically with sequence length due to repeated attention. Cache overhead is particularly acute for reasoning models that generate extended chain-of-thought traces, which not only increase memory usage but also expose the inadequacy of eviction and compression heuristics that suffice for retrieval or next-token tasks (Du et al., 9 Oct 2025, Dong et al., 3 Feb 2026).

2. Cache Compression and Adaptive Eviction: Algorithms and Optimization

KV cache learning addresses these bottlenecks by learning efficient compression and eviction policies that retain only those KV entries (tokens, heads, or layers) critical for downstream performance. Core methodologies include:

Head-wise Sparsification and RL-based Selection: RLKV (Du et al., 9 Oct 2025) formulates per-head cache selection as a combinatorial mask optimization, directly coupling each head’s cache retention to the chain-of-thought reasoning accuracy. Policy parameters $\alpha$ (via sigmoid activations from logits $\theta$ ) are optimized with a clipped PPO objective to discover a sparse set of “reasoning-critical” heads, with the remainder using a streaming cache. Only $20$– $50\%$ of the heads might be kept with $<2\%$ performance loss.
Global and Historical Importance Scoring: G-KV (Liao et al., 29 Nov 2025) computes a combined global attention score for each cached token, blending local multi-head scores with a decayed history. Periodic eviction discards the lowest-scoring tokens, retaining an adaptive window. This method outperforms local-only scores, supporting up to $96\%$ cache reduction in long contexts and scaling up batch throughput by $4$– $12\times$ .
Golden Labels and Policy Learning: ForesightKV (Dong et al., 3 Feb 2026) constructs a “golden eviction trace” by tracing the true future attention paid to each cache entry, and distills this into a supervised scoring model with a pairwise ranking loss; further refinement uses a PPO-style RL objective focused on preserving prediction quality for low-entropy (certain) tokens, which are empirically crucial for multi-step reasoning.

Table: Compression Methods and Core Mechanisms

Method	Compression Type	Optimization Principle
RLKV (Du et al., 9 Oct 2025)	Per-head, binary mask	RL (PPO, L1 sparsity, answer reward)
G-KV (Liao et al., 29 Nov 2025)	Per-token, window+score	Global/historical scoring, RL/distill post-train
ForesightKV (Dong et al., 3 Feb 2026)	Per-token, dynamic	Distilled "golden" ranking, RL (GRPO)
KVP (Moschella et al., 10 Feb 2026)	Per-token, ranking	Per-head RL agents, Plackett-Luce policy
Task-KV (He et al., 25 Jan 2025)	Head semantic split	Semantic center distance, class-specific alloc
XKV (Li et al., 2024)	Per-layer personalized	Combinatorial allocation/greedy per input

3. Head, Layer, and Task Adaptivity

Experimental evidence shows attention heads and transformer layers exhibit strong functional heterogeneity: some are indispensable for preserving long-range reasoning integrity (“reasoning-critical”), while others can be safely compressed or pruned (Du et al., 9 Oct 2025, He et al., 25 Jan 2025). Task-KV (He et al., 25 Jan 2025) further defines a semantic separation per layer between “heterogeneous” heads (distant from the semantic centroid, carrying complementary task-specific semantics) and “non-heterogeneous” heads (aggregation/focusing roles). Task-KV allocates the full cache to heterogeneous heads and reserves only a minimal, strategically chosen subset (recent tokens, attention sinks, and “middle activations”) for others, using a data-driven semantic separator. This dynamic allocation recovers full-context performance at $40\%$ memory.

Layer-wise cache size personalization, as in XKV (Li et al., 2024) and PrefixKV (Wang et al., 2024), leverages variations in per-layer importance distributions: a tight allocation can be greedily assigned to layers with the steepest marginal retention gain. Averaging over multiple sampled tasks yields allocations that are robust to input variations and application domains.

4. Representation Learning and Cache Reuse

Beyond compression, KV caches represent rich contextual embeddings that can be explicitly leveraged for latent reasoning (Kuzina et al., 2 Oct 2025), rapid prefix-reuse (Pandey, 4 Dec 2025), or cross-layer sharing.

Latent Distillation and Representation Compression: KaVa (Kuzina et al., 2 Oct 2025) distills the “compressed” KV-cache of a chain-of-thought teacher into a latent-reasoning student by aligning the student’s abstract latent-token KV traces to those extracted (and redundancy-filtered) from the teacher. The loss combines cross-entropy, latent activation matching, and direct alignment of compressed key/value tensors. This approach closes most of the accuracy gap to full chain-of-thought, with order-of-magnitude efficiency gains.
Cache Recycling: Exact KV recycling (Pandey, 4 Dec 2025) allows rapid inference when a prompt’s prefix matches a cached context, transferring all past key-value tensors for immediate reuse without recomputation. Latency reductions scale linearly with matched prefix length and do not degrade output semantics.
KV Reuse and Dimensional Compression: KV-CAR (Roy et al., 7 Dec 2025) applies per-layer autoencoders to compress K/V vectors along the embedding dimension and further exploits cross-layer similarity among heads for storage sharing. This modular strategy achieves nearly $48\%$ cache reduction at sub-4% accuracy loss across challenging zero-shot benchmarks.

5. Practical Impacts, Trade-offs, and Theoretical Limits

KV cache learning enables high-throughput LLM inference under stringent memory budgets, unlocks longer context lengths, and is especially effective for complex reasoning models. Key trade-offs include:

Memory–Quality Pareto: Learned methods (RLKV, ForesightKV, KVP) achieve minimal accuracy loss (<2–3%) at 2–4 $\times$ memory reduction, often surpassing attention-sum, recency, or window-based heuristics, for which accuracy collapses at moderate compression (Du et al., 9 Oct 2025, Dong et al., 3 Feb 2026, Moschella et al., 10 Feb 2026).
Overhead and Integration: Most RL-based or combinatorial learning approaches are trained in an offline phase and introduce negligible inference overhead (<3%) (Dong et al., 3 Feb 2026, Moschella et al., 10 Feb 2026). Many are backward-compatible with paging strategies, quantization, and other deployment- and hardware-level optimizations (Jha et al., 24 Feb 2025).
Theoretical Limits and Multi-Worker Serving: In serving scenarios, neither classic LRU nor static allocation suffice; randomization (marking algorithms) and learning-based routing (queue-aware regression) jointly yield logarithmic-competitive cache hit rates, balancing load and minimizing end-to-end latency (Wu et al., 26 Jan 2026).

6. Extensions and Open Challenges

Ongoing challenges in KV cache learning include meta-learning hyperparameters (cache window size, decay rate), extending adaptation across modalities (AirCache for vision-LLMs (Huang et al., 31 Mar 2025)), and integrating global eviction strategies into pretraining (Liao et al., 29 Nov 2025). There is active exploration of joint optimization combining compression, quantization, and eviction with retrieval or merging layers for extreme long-context inference. Handling dynamic or interactive contexts (fuzzy prefix matching (Pandey, 4 Dec 2025), adaptive routing (Wu et al., 26 Jan 2026)) and scaling to multi-LLM settings remain unsolved in general.

Empirical results indicate robust generalization: e.g., policies trained on chain-of-thought reasoning transfer well to science and code, and per-head ranking agents (KVP) demonstrate zero-shot effectiveness on unseen domains (Moschella et al., 10 Feb 2026). Learned cache management is thus emerging as a cornerstone for efficient, scalable, and deployable LLM applications, especially in contexts demanding verifiable reasoning integrity and controllable quality-memory trade-offs.