KV-off Decoding in LLMs
- KV-off decoding is a set of strategies that selectively disable or compress key–value caches to reduce memory usage and ensure deterministic outputs.
- Techniques such as no-caching, cache pruning, semantic aggregation, and hardware-efficient methods achieve up to 45× speedup and significant memory savings.
- These methods are applied in both autoregressive and non-autoregressive models, enabling efficient long-sequence inference and improved hardware utilization.
Key–Value (KV) off decoding refers to a set of strategies and mechanisms for reducing or disabling the use of the key–value cache during LLM inference, with the objective of reducing memory/computation overhead, improving hardware utilization, or ensuring reproducible deterministic output. These methods contrast with the standard “KV-on” paradigm, where per-token key and value projections from all past steps are cached and reused to avoid recomputation. Recent advances show that various forms of KV-off decoding—ranging from cache pruning, aggregation, or compression to strategic cache refresh schedules or even fully disabling caching—can yield substantial efficiency and reproducibility benefits across autoregressive and non-autoregressive (e.g., diffusion-based) LLMs.
1. Definitions, Modalities, and Taxonomy
KV-off decoding encompasses a spectrum of approaches where the full per-token key (K) and value (V) cache is either compressed, partially evicted, dynamically offloaded, or disabled for all or part of the decoding process. The main modalities include:
- No-caching (use_cache=false): Each generation step recomputes attention over the entire prefix, leading to deterministic and fully reproducible outputs but incurring increased latency. Representative of the evaluation protocol in emotion classification pipelines such as EmoLoom-2B (Li et al., 3 Jan 2026).
- Cache pruning (compression): Selectively evicting low-contribution KV entries using estimation techniques (e.g., expected attention), retaining only the most relevant past states in the cache (Devoto et al., 1 Oct 2025).
- Block-wise or group-wise caching: Aggregating or tying KV heads (e.g., Grouped-Tied Attention and Grouped Latent Attention) to minimize data transfer and storage (Zadouri et al., 27 May 2025).
- Semantic aggregation and offloading: Storing only compressed high-level (e.g., sentence-level) semantic vectors on device, offloading detailed KV to CPU and retrieving only on demand; exemplified by SentenceKV (Zhu et al., 1 Apr 2025).
- Dynamic or adaptive refresh: Selective, attention-driven KV refresh in non-autoregressive architectures to avoid redundant recomputation (Nguyen-Tri et al., 16 Oct 2025).
- Sparse/dynamic retrieval and offloading: Maintaining a small, dynamically constructed retrieval cache on fast device memory, with bulk KV offloaded to host, as in TriForce speculative decoding (Sun et al., 2024).
Autoregressive vs. non-autoregressive contexts: KV-off can operate in both AR and diffusion-based models. Diffusion LLMs particularly benefit due to the high redundancy in their block-parallel decoding schedules (Wu et al., 28 May 2025, Nguyen-Tri et al., 16 Oct 2025).
2. Core Methodologies and Algorithmic Mechanisms
Direct KV-Off (use_cache=false)
Disables all caching in the inference loop. At each decoding step, the model runs a full forward pass over all available tokens, recomputing attention from scratch without ever reusing K/V matrices. This yields identical outputs across training and evaluation, eliminates stochastic variation due to hardware backends, and is mandated for strict protocol-faithful evaluation in screening pipelines (Li et al., 3 Jan 2026).
Cache Pruning and Compression by Expected Attention
Expected Attention (Devoto et al., 1 Oct 2025) estimates the post-hoc importance of each KV pair under a probabilistic model of future queries. The method models the future query distribution as Gaussian and computes, in closed form, the expected exponential score of each key, yielding a principled expected attention score. Low-contribution KV entries are pruned periodically, reducing memory use by up to 50–75% with <1% downstream quality loss for standard settings, and the approach is agnostic to attention mechanism implementation (including Flash Attention).
Semantic Aggregation and Sentence-Level Caching
SentenceKV (Zhu et al., 1 Apr 2025) compresses token-level K/V into sentence-level semantic representations, storing only these aggregates on the GPU. Token K/V pairs with high importance (by backward attention from a sliding window) are offloaded to CPU and brought back only when semantically relevant sentences are hit during decoding. This minimizes GPU memory growth from O(L) to O(S) (number of sentences), ensuring flat decoding latency and substantial memory savings (up to 40% on long contexts).
Hardware-Efficient and Grouped/Tied Attention
“KV-off” in hardware-optimized architectures (Zadouri et al., 27 May 2025) refers to compressing/tying K and V projections or reorganizing the attention computation so fewer bytes are moved per token, thus saturating compute more effectively and improving kernel throughput. Grouped-Tied Attention ties K/V and reuses positional encodings, effectively halving the per-token KV cache. Grouped Latent Attention stores only a small device-local latent cache and absorbs the majority of up-projection weights, further lowering the per-token fetch and enabling higher degrees of tensor parallelism.
Blockwise and Adaptive KV Cache for Diffusion LLMs
- Fast-dLLM (Wu et al., 28 May 2025): Implements a blockwise approximate KV cache, reusing “stale” cache for non-active blocks over several diffusion steps, justified empirically by low cosine drift of non-central K/V, with ≤2 point accuracy loss and speedups up to 27.6×.
- Elastic-Cache (Nguyen-Tri et al., 16 Oct 2025): Adaptively recomputes K/V states for subsets of tokens/layers according to an attention-aware drift test, leveraging the observation that the most-attended token provides a robust lower bound on cache drift. Distant MASK tokens are block-cached. The schedule is layer-aware, achieving up to 45× speedups in long-sequence decoding for diffusion models.
Sparse Retrieval and Multi-Memory Hierarchies
TriForce (Sun et al., 2024) offloads the full KV cache to host (CPU) memory, constructs a small retrieval cache via chunk-based vector similarity, and fetches K/V blocks on demand for speculation. Only the most contextually relevant cache blocks are loaded onto device, and hierarchical speculation guarantees that generation quality is maintained without drift.
3. Performance Implications and Empirical Benchmarks
KV-off decoding strategies yield dramatic savings in memory and variable speedups, as summarized below (all numbers are drawn from reported results):
| Approach | Context Lengths | Memory Saving | Speedup | Quality Delta |
|---|---|---|---|---|
| SentenceKV | 32K–256K | 30–40% (GPU) | Stable latency | PPL within 0.1; matches full KV accuracy (Zhu et al., 1 Apr 2025) |
| Fast-dLLM | up to 1024 | Blockwise cache: | Up to 27.6× | ≤2 point accuracy loss (Wu et al., 28 May 2025) |
| Elastic-Cache | 512+ | Redundant FLOPs cut by 80–90% | Up to 45× | ≤1% accuracy change (Nguyen-Tri et al., 16 Oct 2025) |
| ExpectedAttention | 120K | 8–15 GB on 8B model (50–90%) | Net speedup when I/O bottleneck | ≤1% quality loss (50% prune) (Devoto et al., 1 Oct 2025) |
| Hardware-efficient attention (GLA/GTA) | - | 44–50% KV cache/device | 1.2–2.0× throughput | No measurable decrease (Zadouri et al., 27 May 2025) |
| TriForce | 128K | Offloads 100k+ tokens | 7.78× (offloading), 2.3× (on-chip) | Identical, lossless (Sun et al., 2024) |
| EmoLoom-2B KV-off | n/a | 0% (full recompute) | 1.2–1.5× higher latency | Eliminates seed/hardware variance (Li et al., 3 Jan 2026) |
Table: Summary of key performance results (see referenced papers for exact experimental settings).
In all cases, appropriate integration of the KV-off technique is necessary to maintain accuracy; e.g., over-aggressive pruning or poor chunking can degrade retrieval or QA accuracy (Zhu et al., 1 Apr 2025, Devoto et al., 1 Oct 2025).
4. Theoretical Considerations and Algorithmic Properties
Memory Complexity and Scaling
- Standard KV caching: Memory complexity O(L·H·d) grows linearly with the accumulated context length L, attention heads H, and head dimension d.
- SentenceKV and semantic caching: O(S·H·d) + O(τ·d), where S is sentence count and τ is per-step token budget, reducing active on-device cache size by orders of magnitude for long texts (Zhu et al., 1 Apr 2025).
- Hardware-efficient attention: Achieves further constant-factor reduction by grouping/tied heads and storing only a subset of latent cache per tensor-parallel rank (Zadouri et al., 27 May 2025).
- Cache pruning: Affords linear scaling in memory savings, parameterized by a target compression fraction; empirical sweet spots are 2–4× compression before accuracy loss becomes significant (Devoto et al., 1 Oct 2025).
Cache Drift and Safe Eviction
All effective KV-off mechanisms are predicated on bounding the impact of omitted information. Theoretical results (e.g., Fast-dLLM’s KL/TV bounds for high-confidence parallel decoding (Wu et al., 28 May 2025)) and drift tests (elastic cache (Nguyen-Tri et al., 16 Oct 2025)) support aggressive caching strategies by leveraging the concentration of attention or measured cache stability. Gaussian attention estimation in Expected Attention further provides closed-form safety criteria for pruning (Devoto et al., 1 Oct 2025).
5. Applications and Protocol Implications
- Long-context and large-batch inference: Enables practical deployment of LLMs on very long sequences (100K+ tokens) and high concurrency in online serving settings by mitigating HBM exhaustion and latency bottlenecks (Sun et al., 2024, Zhu et al., 1 Apr 2025, Zadouri et al., 27 May 2025).
- Training–inference equivalence and reproducibility: Protocols such as EmoLoom-2B (Li et al., 3 Jan 2026) require KV-off for exact matching between training-time and eval-time execution, eliminating backend variance and ensuring output invariance under seed/system changes.
- Diffusion LLMs: Aggressive KV-off (blockwise caching, adaptive refresh) is necessary for tractable inference due to the iterative bidirectional nature and the potential for redundant computation across many denoising steps (Wu et al., 28 May 2025, Nguyen-Tri et al., 16 Oct 2025).
Experimental outcomes show that with appropriate safety mechanisms, most KV-off methods preserve core modeling metrics (PPL, pass@1, accuracy) to within <1–2% even at extreme memory reductions, and sometimes improve practical throughput by over an order of magnitude.
6. Limitations, Assumptions, and Future Directions
Core Assumptions
- Efficacy depends on the actual redundancy and structure in LLM attention patterns; in adversarial or highly nonlocal contexts, drift-based methods may be less safe (Nguyen-Tri et al., 16 Oct 2025).
- Semantic chunking techniques (SentenceKV) assume effective segmentation and sufficient granularity; equal-sized bucketing is consistently inferior (Zhu et al., 1 Apr 2025).
- Adaptive refresh and expected attention depend on accurate statistics of hidden states and attention drift, which may suffer under nonstationary or highly heterogeneous data.
- Some techniques require hardware/software adaptation in kernel implementation (GTA/GLA), or runtime hooks for safe cache management (KVPress) (Devoto et al., 1 Oct 2025, Zadouri et al., 27 May 2025).
Research Directions
- Integration with speculative decoding, mixture-of-experts, and block-sparse architectures is an active area, aiming to amortize recompute cost or further reduce cache granularity (Sun et al., 2024, Nguyen-Tri et al., 16 Oct 2025).
- Extension to settings with very small per-device caches, multi-user systems, or mobile inference, where aggressive KV-off strategies are especially valuable.
- Auto-tuning of refresh thresholds, semantic keeping factors, and per-head budgets, possibly using train-time supervision or meta-optimization (Zhu et al., 1 Apr 2025, Nguyen-Tri et al., 16 Oct 2025).
7. Historical Context and Standardization
The move towards KV-off decoding has accelerated as LLM context lengths and deployment scale have increased, exposing memory and latency as principal bottlenecks. While early approaches were primarily heuristic or fixed-ratio pruning, recent research demonstrates systematic, theory-informed approaches spanning from cache compression (Expected Attention (Devoto et al., 1 Oct 2025)) and group-based hardware adaptation (Zadouri et al., 27 May 2025) to architecture-specific refresh protocols (Elastic-Cache (Nguyen-Tri et al., 16 Oct 2025)) and pipeline-intrinsic design (EmoLoom-2B (Li et al., 3 Jan 2026)). The standardization of KV-off as both an engineering principle and a reproducibility prerequisite is being solidified in both open-source libraries (e.g., KVPress) and protocol-driven pipelines.
For further in-depth methodological details, benchmarks, and code, see the referenced works: (Zhu et al., 1 Apr 2025, Devoto et al., 1 Oct 2025, Wu et al., 28 May 2025, Nguyen-Tri et al., 16 Oct 2025, Sun et al., 2024, Zadouri et al., 27 May 2025, Li et al., 3 Jan 2026).