KV-Policy in Transformer Models

Updated 21 April 2026

KV-Policy is a set of learnable, data-driven methods that manage transformer caches by prioritizing tokens based on predicted future utility.
It employs techniques like reinforcement learning, salience estimation, and hybrid compression to balance memory constraints with generative fidelity.
These strategies dynamically allocate cache budgets and adapt retention rules to maintain high reasoning and generation quality under strict compute limits.

A key-value policy (KV-Policy, KVP) in the context of transformer models refers to any formalized, often learnable procedure for managing the growth, retention, compression, or eviction of key-value (KV) pairs in the attention cache during autoregressive inference. Modern KVPs have evolved far beyond heuristic FIFO or fixed-window methods; recent literature presents a spectrum of policies encompassing attention-aware pruning, reinforcement learning (RL)-based ranking, salience estimation, budget reallocation, and hybrid quantization approaches. These developments address the central challenge: maintaining high generative or reasoning fidelity under strict KV memory and compute constraints—essential for efficient long-context language, vision, and multi-modal models.

1. The RL Formulation of KV-Policy: Future-Utility-Based Token Ranking

A central advance is casting KV cache management as a reinforcement learning problem, where the objective is to learn a per-head, budget-agnostic ranking of cache entries by their predicted future utility. Formally, suppose an autoregressive transformer yields a cache of $n$ KV pairs. The task is to learn a scoring function $f(x_i; \theta)$ —with $x_i$ comprised of $(k_i, v_i, pos_i)$ —that sorts the entries so, for any sub-budget $b \leq n$ , retaining the top $b$ yields minimal downstream quality degradation.

The policy is implemented as a Plackett–Luce distribution over permutations, with stochastic permutation $\sigma$ sampled as: $\pi_\theta(\sigma | x_1 \ldots x_n) = \prod_{i=1}^{n} \frac{\exp(f(x_{\sigma_i};\theta))}{\sum_{j=i}^n \exp(f(x_{\sigma_j};\theta))}$ The expected normalized reward is derived from the ground-truth future attention that would have been received by each cache element—measured by replaying generation traces and summing over future query attentions. Training uses policy-gradient methods (REINFORCE with leave-one-out baseline), yielding robust KV eviction agents that generalize across context lengths, tasks, and model families. Agents are independent two-layer MLPs per layer and head, trained solely on offline traces of $(k,v,pos)$ and future attention matrices. This mechanism, detailed by "Learning to Evict from Key-Value Cache" (Moschella et al., 10 Feb 2026), outperforms both heuristic (e.g., streaming, norm-based) and query/attention-aware eviction methods in long-context settings.

2. Salience and Retention-Score Based Policies

Alternative learnable approaches explicitly model per-token "salience" or long-term utility at the time of assignment. The TRIM-KV policy leverages a lightweight MLP gate per transformer head, producing a retention score $r_t^{\ell, h}\in [0,1]$ for each token. This scalar decays exponentially as time elapses: $f(x_i; \theta)$ 0 Tokens are evicted greedily by the smallest decayed retention value when the cache exceeds a strict global budget. During training, gate parameters are optimized with a KL-regularized loss (matching a frozen teacher) and a capacity loss: $f(x_i; \theta)$ 1 This design supports strict $f(x_i; \theta)$ 2 memory usage, negligible compute overhead, and demonstrates emergent interpretability (with learned retentions mapping to classic memory heuristics) and even quality regularization (Bui et al., 3 Dec 2025).

Video and diffusion settings introduce salience estimation heads—trained via distillation from a bidirectional teacher—to directly regress per-token importance, again supporting evict-keep decisions with minimal loss of long-range structural fidelity (Chen et al., 29 Jan 2026).

3. Policy Designs for Specific Generation Workloads

KVPs are often tailored to particular inference regimes. In Chain-of-Thought (CoT) reasoning, Crystal-KV introduces an "answer-first" principle: only retain think-stage KV entries contributing to the answer token generation. This is operationalized by mapping answer embed attentions onto think-stage tokens, calculating aggregate answer-attention scores $f(x_i; \theta)$ 3, and classifying tokens as either CrystalKV (score in top quantiles) or SlipKV (low-score, ephemeral context maintainers). Eviction applies a per-head attention-based LRFU (Least-Recently-Frequently-Used) strategy, with cache budgets adaptively redistributed based on the dynamic utilization (CRF score) across heads and layers. This method achieves strong compression while maintaining, or even boosting, final answer accuracy (Wang et al., 5 Jan 2026).

In retrieval-augmented and chunked-context settings, InfoFlow KV-Policy frames token selection for recomputation as information-flow maximization. The importance score $f(x_i; \theta)$ 4 for each context token is the aggregate attention mass it receives from the prompt: $f(x_i; \theta)$ 5 where $f(x_i; \theta)$ 6 is computed using inference-consistent RoPE positioning. Only tokens with top- $f(x_i; \theta)$ 7 $f(x_i; \theta)$ 8 are recomputed under a global causal mask, yielding efficiency and fidelity improvements on long-context benchmarks and vision-language tasks (Teng et al., 5 Mar 2026).

4. Hybrid and Layer-Adaptive KV-Policies

Several KVPs exploit architectural structure or signal redundancy across the transformer stack. SpindleKV employs a split-layer policy: deep layers apply attention-weight-based eviction (dropping low-contributing tokens), while shallow layers merge highly similar KV vectors via codebook-based replacement (forming basis vectors and storing only indices plus magnitude), exploiting observed high cosine similarity at lower stack levels. Both regimes have specialized selection thresholds and updating algorithms; in grouped-query attention, virtual unfolding ensures per-query granularity (Tang et al., 9 Jul 2025).

AMS-KV extends KVPs to multi-scale, image-generation transformers, retaining all KV from a small number of "condensed scales" (early, structure-setting scales), and then performing similarity-based adaptive retention across local (intermediate) scales per-layer. An explicit similarity threshold $f(x_i; \theta)$ 9 governs whether previous scale KVs must be kept, supporting dynamic allocation under fixed per-layer budgets and minimizing the trade-off between memory usage and fidelity metrics such as FID, PSNR, and LPIPS (Xu et al., 20 Nov 2025).

5. Recurrence-Integrated and Hybrid Compression Policies

LESS synthesizes sparse eviction-based methods (e.g., heavy hitter, recency, or norm-based pruning) with a constant-size, low-rank recurrence buffer. Whenever a KV entry is pruned by the main sparse cache, the corresponding $x_i$ 0 pair is incorporated into a learned low-rank buffer $x_i$ 1 via nonnegative kernel projections $x_i$ 2. At subsequent steps, the attention output is a rational combination of sparse-attended vectors (precise, but few tokens), and a low-rank contribution representing information from all pruned tokens: $x_i$ 3 This mechanism recovers a smooth trade-off between memory and accuracy, suitable for high-memory, long-context inference where some nonlocal recall is crucial (Dong et al., 2024).

6. Adaptivity and Budget Reallocation

Configurable budget allocation—across heads, layers, or time—is a recurring theme. Crystal-KV's adaptive scheduler redistributes global capacity in proportion to recent cumulative importance scores, amplifying budgets to active heads and layers, and shrinking allocation to inactive ones (Wang et al., 5 Jan 2026). AMS-KV enforces per-layer maximums, further tuning allocation based on per-scale similarity demands (Xu et al., 20 Nov 2025). KVP frameworks often produce a total ranking or importance scoring, which can be leveraged for dynamic budget reallocation in distributed or throughput-critical environments (Moschella et al., 10 Feb 2026).

7. Comparative Results and Limitations

Empirical results consistently demonstrate that learned KVPs outperform both attention-aware (which require full-context attention recomputation) and simple attention-free heuristics, especially in low-memory regimes or challenging reasoning benchmarks. For example, KVP achieves higher accuracy at fixed cache budgets, lower perplexity degradation, and better L-ROUGE on summarization than existing methods (Moschella et al., 10 Feb 2026). TRIM-KV, via selective retention, even surpasses full-cache baselines, an effect attributed to implicit regularization through the suppression of off-task or noisy context (Bui et al., 3 Dec 2025). Hybrid policies such as LESS close a substantial fraction of the pruning-induced quality gap (Dong et al., 2024).

Limitations include the independent-per-head agent paradigm (potentially missing cross-head correlations), incompatibility with some kernel-optimized attention implementations, and the challenge of extending token-utility prediction to domains where future importance is highly query-dependent or multimodal. Dynamic, learned allocation of cache budget—potentially in an end-to-end fashion—remains an open research direction (Moschella et al., 10 Feb 2026). Extensions to non-text modalities, hierarchical caches, or hybrid compression–eviction policies are active areas for future KVP development.

KVP thus encompasses a broad and evolving set of memory-constrained inference strategies, unifying learnable, data-driven, and hybrid-analytic approaches for efficient, high-quality transformer decoding across language, vision, and multi-scale domains.