KV Prediction in Neural Architectures

Updated 15 December 2025

KV prediction is a technique to approximate key and value tensors in attention mechanisms, enabling dynamic caching and efficient inference.
It employs auxiliary models and adaptive cache strategies, such as selective eviction and multi-view kernel completion, to reduce computational load.
Empirical studies in language models and vision transformers demonstrate up to 4× speedups and 60% memory reductions with minimal accuracy loss.

Key-Value (KV) prediction encompasses a suite of algorithmic and modeling strategies that infer, select, or approximate the key and value tensors used for attention in neural architectures, typically transformers and kernel machines. In contemporary research, KV prediction serves as a critical mechanism to reduce memory and compute overhead, enable dynamic caching strategies, and optimize inference latency—especially at scale or under resource constraints. Methods range from approximating full KV states via auxiliary models, to data-driven eviction, to adaptive refresh and multi-view kernel completion. This article provides a rigorous survey of KV prediction techniques across LLMs, vision transformers, diffusion models, and kernel methods.

1. Fundamentals of KV Prediction

In transformer architectures, each attention layer computes queries, keys, and values to perform contextual information integration. During autoregressive inference, historical keys and values are cached to eliminate recomputation, creating a KV cache whose size increases linearly with context length and model depth. KV prediction strategies intervene by forecasting or selectively updating this cache, with the goals of memory reduction, computational acceleration, or imputation when data is missing. In kernel methods, KV prediction refers to the completion or inference of missing kernel matrix entries, often via transfer learning across multiple data views.

2. KV Prediction for Fast Inference

One central application is reducing time to first token (TTFT) in LLMs, particularly on edge devices. KV Prediction (Horton et al., 10 Oct 2024) introduces an auxiliary model A and a predictor P that process all prompt tokens and produce a predicted KV cache $\hat{C}$ for a frozen high-capacity base model B. The prediction process is formalized as learning per-layer linear maps $W_i^K, W_i^V$ such that

$\hat{K}_i = K_{j_i}^A (W_i^K)^\top, \qquad \hat{V}_i = V_{j_i}^A (W_i^V)^\top,$

linking auxiliary and base model caches. The cache $\hat{C}$ is used only for generation of the first output token; after this, true KV states are constructed as normal.

Empirically, KV Prediction yields a Pareto frontier in efficiency-accuracy tradeoff, achieving $15{-}50\%$ relative downstream accuracy improvement on TriviaQA at reduced TTFT FLOPs budgets, and $30\%$ relative gain on HumanEval code completion compared to equivalent-size LLM baselines. The method speeds up TTFT up to $2{-}4\times$ on real hardware while maintaining a plug-in design requiring no base model retraining (Horton et al., 10 Oct 2024).

3. Data-Driven Cache Selection and Eviction

KV prediction also resides at the heart of methods designed to identify and retain only the subset of cache entries likely to influence future attention. SAGE-KV (Wang et al., 11 Mar 2025) leverages the sparsity of attentional weight matrices in long-context LLMs to perform a one-time, per-layer, top- $k$ selection post-prefill, evicting tokens and (optionally) attention heads determined unimportant by

$s_i^\ell = \sum_{h=1}^{H_q} A_{h}^{\ell}(N,\,i),$

for the last token $t_N$ . Head importance uses summed attention activity. SAGE-KV matches or exceeds full-attention accuracy up to $4\times$ -smaller working sets, with $2\times$ memory savings over the prior dynamic method Quest at comparable accuracy, and strictly outperforms static selection (StreamLLM) (Wang et al., 11 Mar 2025). Such policies exploit the LLM’s implicit knowledge of salient dependencies, treating cache evictions as a form of learned KV prediction.

Task-KV (He et al., 25 Jan 2025) differentiates KV cache budgets by semantically classifying attention heads as heterogeneous or non-heterogeneous via their Euclidean distance from the semantic center: $d_i = \|h_i - \bar{h}\|_2,$ with $h_i$ the attention-head semantic vector and $\bar{h}$ the layer’s centroid. Heterogeneous heads retain full KV histories, while other heads cache only “attention sinks,” recent tokens, and task-adaptive “middle activations”: contextually-selected intermediate tokens with high historical attention weight. Task-KV achieves up to $60\%$ KV memory reduction at near-full-KV accuracy on multi-task LLMs, substantially outperforming uniform cache allocation (He et al., 25 Jan 2025).

4. Adaptive KV Prediction in Diffusion and Vision Models

In diffusion LLMs, Elastic-Cache (Nguyen-Tri et al., 16 Oct 2025) adaptively predicts and refreshes KV cache entries based on actual cache drift per layer and token. Instead of naive recomputation, Elastic-Cache identifies when and where to update by monitoring the drift statistic of the most-attended token: $\Delta^{t,\ell} = \|K^{t,\ell}_{i^*} - K^{t-1,\ell}_{i^*}\|_2 + \|V^{t,\ell}_{i^*} - V^{t-1,\ell}_{i^*}\|_2,$ triggering a selective cache update from the lowest layer $\ell^*$ where $\Delta^{t,\ell}>\epsilon$ . Block-wise caching of inactive MASK tokens further reduces redundancy. Elastic-Cache delivers up to $45\times$ decoding speedups while preserving accuracy, and is justified by monotonicity lemmas on KV drift and its conservative estimation via most-attended token statistics (Nguyen-Tri et al., 16 Oct 2025).

In scale-adaptive visual autoregressive transformers, AMS-KV (Xu et al., 20 Nov 2025) predicts, prioritizes, and prunes KV states across multi-scale generations. The design hinges on inter-scale similarity,

$\mathrm{Sim}_s^{(\ell)} = \frac{1}{M_s}\sum_{j=1}^{M_s} \frac{\langle K_s^{(\ell)}[j], K_{s-1}^{(\ell)}[j]\rangle}{\|K_s^{(\ell)}[j]\| \|K_{s-1}^{(\ell)}[j]\|},$

to determine which scales and transformer layers demand dense vs. windowed cache storage. AMS-KV achieves memory reductions of up to $84.83\%$ and $60.48\%$ lower self-attention latency, with negligible effect on image generation quality (Xu et al., 20 Nov 2025).

5. KV Prediction in Kernel and Multi-View Models

Outside neural-attention models, KV prediction addresses the kernel matrix completion problem, where missing rows and columns must be inferred in a multi-view setting. Cross-View Kernel Transfer (CVKT) (Huusari et al., 2019) solves this by learning a transfer transformation $U^{(u)}$ that aligns a reconstructed “proxy” kernel in view $u$ produced by projecting concatenated features from all other views. The objective maximizes centered kernel alignment: $J(U^{(u)}) = \frac{\langle C K^{(u)} C,\,C (Z^{(u)} U^{(u)})(Z^{(u)} U^{(u)})^\top C \rangle_F}{\|C K^{(u)} C\|_F \cdot \|C (Z^{(u)} U^{(u)})(Z^{(u)} U^{(u)})^\top C\|_F},$ optimizing $U^{(u)}$ via manifold gradient methods. CVKT empirically attains superior completion accuracy and maintains downstream classification performance under high missingness (Huusari et al., 2019). This methodology generalizes KV prediction to nonlinear feature spaces and data imputation.

6. Comparative Empirical Outcomes

Methodologically, the following table summarizes empirical highlights from key works:

Method	Context/Domain	Max Memory/Latency Reduction	Accuracy Retention
KV Prediction	LLM TTFT	$2{-}4\times$ TTFT speedup	$15{-}50\%$ rel. gain over baselines (Horton et al., 10 Oct 2024)
SAGE-KV	Long-context LLMs	$2{-}4\times$ memory, no per-token overhead	$<1$ pt loss vs. full-attention (Wang et al., 11 Mar 2025)
Task-KV	LLMs/task-aware	$60\%$ memory reduction	Matches full-KV, outperforms SnapKV (He et al., 25 Jan 2025)
AMS-KV	Multi-scale VAR	$84.83\%$ memory, $60.48\%$ latency	Preserves image detail, beats SWA/STA (Xu et al., 20 Nov 2025)
Elastic-Cache	Diffusion LLMs	$8.7{-}45.1\times$ speedup	No quality loss (HumanEval/GSM8K/MathVista) (Nguyen-Tri et al., 16 Oct 2025)
CVKT	Multi-view Kernel	N/A (completion)	Best CA/ARE, matches full-kernel accuracy (Huusari et al., 2019)

Performance is context-sensitive: for memory-bounded inference, SAGE-KV and Task-KV optimize cache retention with attention-guided and semantic strategies. For generation latency, Elastic-Cache and AMS-KV algorithmically minimize redundant computation. For missing-data imputation, CVKT achieves state-of-the-art completion fidelity.

7. Limitations, Open Problems, and Future Directions

Current KV prediction techniques are subject to trade-offs between approximation error and computational gain, especially visible in deep-layer value prediction (error propagation in (Horton et al., 10 Oct 2024)) and correlation with task complexity (as in Task-KV's semantic head separation (He et al., 25 Jan 2025)). Predictive strategies like auxiliary-model approximation depend on architectural alignment, and their extension to nonlinear predictors or joint decoding remains an open direction. In attention compression and eviction, further theoretical characterization of long-range dependency preservation is pending. For multi-view kernels, transferability across heterogeneous domains and missingness patterns marks a constraint (Huusari et al., 2019).

Future work may include:

Nonlinear or cross-layer predictors for cache approximation (Horton et al., 10 Oct 2024)
Dynamic blending of predicted and real caches for long dialogues (Horton et al., 10 Oct 2024)
Joint learning of cache policies with decoding strategies (Nguyen-Tri et al., 16 Oct 2025)
Tighter theoretical bounds relating cache approximation loss to downstream metrics (Horton et al., 10 Oct 2024)
Generalization of semantic cache allocation to multimodal and hierarchical transformers (He et al., 25 Jan 2025)
Broader application of cross-view kernel completion to non-Euclidean or structured similarity matrices (Huusari et al., 2019)

A plausible implication is that KV prediction, as a family of techniques, will become foundational to efficient, adaptive, and robust inference in future large-scale and resource-limited deployments across diverse model architectures.