Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 80 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 117 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Rolling KV Cache for Efficient LLM Inference

Updated 7 October 2025
  • Rolling KV Cache is a dynamic memory management strategy in LLMs that preserves high-fidelity recent activations while compressing older key-value pairs.
  • It combines quantization, sliding-window retention, and selective merging techniques to significantly reduce memory footprint and improve decoding speed with minimal accuracy loss.
  • This approach adapts retention policies based on token importance and system constraints, enabling scalable long-context processing and efficient transformer inference.

A rolling KV cache is a memory management strategy in LLM inference that continuously updates, compresses, and selectively maintains key–value (KV) pairs generated by the transformer’s attention mechanism across potentially very long sequences. As LLMs process increasing context lengths, the rolling KV cache approach seeks to efficiently “roll” the cache—preserving high-fidelity recent activations, aggressively compressing or merging older content, dynamically adjusting retention, and aligning storage with practical inference constraints. This enables scalable, high-throughput, and low-latency inference while maintaining accuracy with a bounded memory budget.

1. Design Principles and Motivations

The rolling KV cache paradigm arises from three core challenges:

  • Linearly growing memory footprint: In standard transformer decoding, each generated token adds a new KV pair at every layer, resulting in O(NL)O(NL) memory for %%%%1%%%% sequence length and LL layers.
  • Temporal token importance: Due to locality in the attention mechanism, recent tokens disproportionately influence predictions; early tokens exert diminishing influence but cannot be naively evicted without potential context loss.
  • Dynamic workload and system constraints: Serving environments must handle variable context lengths, batch sizes, and hardware heterogeneity; static or uniform cache strategies are inadequate.

The defining principle of a rolling KV cache is differential treatment of time-ordered tokens—recent tokens are kept with higher fidelity (often full precision and uncompressed), while distant tokens are quantized, compressed, downsampled, or merged, always in a manner that balances memory, accuracy, and throughput.

2. Quantization and Sliding-Window Strategies

Quantization and sliding-window retention play a central role in rolling KV caches.

  • SKVQ ("Sliding-window Key and Value Cache Quantization") (Duanmu et al., 10 May 2024):
    • The most recent ww tokens are always kept at full precision, leveraging the observation that attention heavily favors recent context (“locality of attention”).
    • Older tokens are quantized to extreme low bitwidth (as low as 2 bits for keys, 1.5 bits for values), with group-wise channel reordering to maximize quantization uniformity, and clipped dynamic quantization within groups to mitigate outlier effects. This is formalized as:

    f(α,X)=clamp((Xz)/h,0,2N1),h=αmax(X)min(X)2N1,z=αmin(X)hf(\alpha, X) = \operatorname{clamp}\left( \left\lfloor (X-z)/h \right\rfloor, 0, 2^{N} - 1 \right), \quad h = \alpha \frac{\max(X)-\min(X)}{2^{N}-1}, \quad z = \frac{\alpha \min(X)}{h} - This ensures rolling preservation of high-precision tokens at the cache tail, enabling context lengths up to 1M tokens on an 80GB GPU for a 7B model, and yields up to 7× decoding speedup with negligible (<5%) accuracy loss on LongBench.

  • RotateKV (Su et al., 25 Jan 2025) advances on extreme low-bit quantization via outlier-aware rotations (using FWHT), pre-RoPE grouped-head rotation, and attention-sink-aware quantization. These techniques minimize quantization error in a rolling cache by adapting to channel-wise and head-wise statistics, achieving <0.3 PPL degradation with 2-bit cache and 3.97× memory reduction.

  • LeanKV (Zhang et al., 4 Dec 2024) dynamically partitions tokens by significance and precision, combining high-precision storage for recent and important tokens with fine-grained lower precision or pruning for older/unimportant ones. This mixed-precision, rolling quantization is orchestrated by an on-GPU unified memory manager that compacts memory to contiguous regions.

3. Merging, Pruning, and Structured Replacement Techniques

In addition to quantization, merging and structured selection methods support rolling KV cache management:

  • KVMerger (Wang et al., 11 Jul 2024):

    • Adapts locality-constrained agglomerative clustering to identify consecutive key states (tokens) with high mutual cosine similarity, creating “merging sets”.
    • Within each set, applies a Gaussian kernel weighted merge:

    kmerge=wpkp+ipwiki,wi=gpi/jgpj,gpi=exp(kpki2/2σ2)k_{merge} = w_p k_p + \sum_{i \ne p} w_i k_i, \quad w_i = g_{pi} / \sum_j g_{pj}, \quad g_{pi} = \exp(-\|k_p{-}k_i\|^2/2\sigma^2) - This selective and locality-preserving rolling merge maintains fidelity with the original attention output, outperforming eviction-based and previous merging baselines at both 50% and 35% cache budgets.

  • WeightedKV (Yuan et al., 3 Mar 2025):

    • Retains keys of important tokens (anchors), discards less important keys, and “rolls” their values into adjacent anchors via a convex combination weighted by attention scores, e.g.:

    v~eqk1/deqk1/d+eqk2/dv1+eqk2/deqk1/d+eqk2/dv2\tilde{v} \approx \frac{e^{q^\top k_1 / \sqrt{d}}}{e^{q^\top k_1 / \sqrt{d}} + e^{q^\top k_2 / \sqrt{d}}} v_1 + \frac{e^{q^\top k_2 / \sqrt{d}}}{e^{q^\top k_1 / \sqrt{d}} + e^{q^\top k_2 / \sqrt{d}}} v_2 - Experimental results show this “rolling merge” yields lower perplexity under tight cache budgets compared to eviction or coarse merging, especially as value representation is less redundant and needs to be preserved rather than discarded.

  • KeepKV (Tian et al., 14 Apr 2025):

    • Adopts an “Electoral Votes” mechanism for tracking merging history and introduces zero inference-perturbation merging (ZIP-mg), ensuring the merged cache’s contribution to attention is mathematically equivalent (perturbation-free) to the full cache at the current step.
    • This is critical for rolling caches since error does not accumulate as the cache is repeatedly merged/compressed:

    ot=i=1tpisitvii=1tpisito_t = \frac{\sum_{i=1}^{t} p_i s_i^t v_i}{\sum_{i=1}^{t} p_i s_i^t} - Under rolling operation, multi-step attentions are managed via EMA prediction to keep perturbation bounded as the cache “rolls”.

4. Dynamic, Task-Adaptive, and Layerwise Eviction/Retention

Rolling KV cache management benefits significantly from task- and layer-aware retention, and from dynamically adapting budgets over time:

  • DynamicKV (Zhou et al., 19 Dec 2024):

    • Implements a per-layer, per-head, task-adaptive rolling retention schedule by pooling recent attention scores at every layer/head, then selecting top-scoring tokens according to a rolling, periodically-updated budget:

    bs=(wtws)rmax,Al=TopK(Al,bs),KVl=KVl[Al.indices]bs = (wt - ws) \cdot r_{max}, \quad A_l' = \text{TopK}(A_l, bs), \quad KV_l' = KV_l[A_l'.\text{indices}] - The “rolling update” every mm layers leverages statistical attention pooling (e.g., moving window) to adapt the cache’s contents to the ongoing task’s distribution, allowing compression to as low as 1.7% of the full cache with ~85% full-cache task performance.

  • CAKE (Qin et al., 16 Mar 2025):

    • Frames cache budget as a “cake-slicing problem”: each layer is allocated budget BB_\ell proportional to dynamic preference P=H1/τ1V1/τ2P_\ell = \mathcal{H}^{1/\tau_1} \cdot \mathcal{V}^{1/\tau_2}, where H\mathcal{H} is entropy of spatial attention and V\mathcal{V} is temporal variance. Eviction indicator per token, I[n]=Mean(Arecent(:,n))+γVar(Arecent(:,n))I[n] = \text{Mean}(A_{recent}(:,n)) + \gamma \cdot \text{Var}(A_{recent}(:,n)), allows retention of tokens whose importance shifts over time, cascading the rolling update across layers.
  • SpindleKV (Tang et al., 9 Jul 2025):
    • Targets deep versus shallow layers with distinct rolling reduction: attention weight-based eviction in deep layers (where attention is sparse), and codebook-based representation merging in shallow layers (where redundancy is high within token representations). The codebook strategy learns basis vectors and proxies for similar tokens, with algorithms controlling the rolling addition and replacement online.
  • KVCompose (Akulov et al., 5 Sep 2025):
    • Aggregates per-head, per-layer attention-derived importance scores, sequentially selects top tokens for each head, and then composes "composite tokens"—aligning across heads so as to maintain uniform tensor layouts across a rolling cache.
    • Global budget allocation (“layer-adaptive retention”) then assigns composite retention dynamically, favoring those layers whose context is more informative to the task, supporting both scalability and rolling updates under high compression.

5. Practical Systems, Storage Hierarchies, and High-Throughput Scheduling

Rolling KV caches are implemented in practice through advanced system strategies that balance computation, memory hierarchy, and IO bottlenecks:

  • Cake system (Jin et al., 4 Oct 2024) for long-context LLM inference:
    • Splits a long prompt into chunks, begins forward cache computation from the front on GPU, while fetching precomputed caches for the tail from disk in parallel, with compute and IO pointers marching toward each other (“bidirectional scheduling”).
    • Dynamically determines the meeting point (merge) of the pointers to minimize overall Time-To-First-Token (TTFT), achieving average 2.6× TTFT reduction.
  • AdaptCache (Feng et al., 28 Aug 2025):
    • Implements a utility-maximizing hierarchical storage: for each cache entry, selects compression method/ratio and placement (DRAM/SSD) according to predicted frequency of reuse (obtained by offline profiling and exponential fit), quality loss, and device bandwidth.
    • Utility per entry is per:

    Utility(i)=Freq(i)×[αQuality(i,Mi,Ri)size(i,Mi,Ri)Bandwidth]\text{Utility}(i) = \text{Freq}(i) \times [\alpha \cdot \text{Quality}(i, M_i, R_i) - \frac{\text{size}(i, M_i, R_i)}{\text{Bandwidth}}] - In rolling scenarios, this enables maximal DRAM hit rates by adaptively compressing/placing entries as the cache “rolls” with ongoing requests. Resulting delay reductions range from 1.43–2.4× at constant quality.

  • FlowKV multi-turn isolation (Liu et al., 21 May 2025):

    • For multi-turn conversations, isolates previously compressed cache segments (“locking in” past turns), and only applies compression to new KV pairs from the latest turn, perfectly fitting the rolling paradigm by preventing error accumulation (catastrophic forgetting) as sessions evolve.

6. Theoretical, Streaming, and Cross-Layer Considerations

  • BalanceKV (Han et al., 11 Feb 2025) introduces discrepancy-theoretic streaming selection: recursively partitioning and merging tokens in a streaming fashion to maintain an ϵ\epsilon-approximate attention computation with sublinear space. The selection is geometric and query-agnostic, ensuring error is controlled as the cache “rolls” forward, with provable space–approximation bounds.
  • xKV (Chang et al., 24 Mar 2025) leverages alignment of dominant singular vectors (principal subspaces) across adjacent layers’ KV caches. Joint SVD over grouped layers finds a low-rank subspace that compactly codes the entire rolling cache, allowing consolidation across layers with up to 6.8× compression and improved accuracy over standard single-layer SVD.

7. Rolling KV Cache Application and Future Directions

The rolling KV cache framework enables extremely long-context LLM inference (often up to hundreds of thousands or millions of tokens on a single GPU), enhances cache reusability and sharing under server workloads, and achieves graceful performance degradation or even improvements under tight resource constraints. Methods that blend quantization, selective merging, careful cache scheduling, and dynamic and hierarchical budgeting form the state of the art.

A plausible implication is that future research will further integrate query-aware and query-agnostic selection, unify structured and unstructured retention mechanisms, and tightly couple rolling caches with retrieval, storage, and multi-user scheduling policies in distributed deployments. Robustness to error accumulation, adaptability to heterogeneous hardware, and support for task-adaptive and user-personalized long-term contexts are likely to be primary foci. The rolling KV cache thus represents the current frontier in scalable, efficient transformer-based LLM serving.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Rolling KV Cache.