Papers
Topics
Authors
Recent
2000 character limit reached

FastKV: Efficient KV Cache Compression

Updated 5 December 2025
  • FastKV is a KV cache compression framework that decouples prefill computation from decoding memory reduction using independent TSP and KV retention rates.
  • It employs a two-stage token-selective propagation mechanism ensuring early layers use full context while later layers focus on salient tokens.
  • Experimental results demonstrate up to 2.87× speedup in decoding and minimal accuracy loss on long-context benchmarks.

FastKV is a KV cache compression framework for LLM inference, architected to accelerate both prefill and decoding stages under long-context scenarios. The method is designed around the insight that early model layers benefit from full-sequence context to determine critical dependencies, whereas later layers’ token importance stabilizes, enabling aggressive reduction in context and cache. FastKV explicitly decouples the reduction of prefill computational cost (via context propagation rate, or “TSP rate”) from the reduction of KV cache memory/traffic in decoding (“KV retention rate”), thus overcoming critical coupling limitations in prior works (Jo et al., 3 Feb 2025).

1. Motivation and Limitations of Previous Approaches

Long-context LLM inference involves two dominant computational bottlenecks: prefill (quadratic in prompt size due to full self-attention in each layer) and decoding (which demands linear memory bandwidth proportional to the growing KV cache). As prompt lengths scale into the hundreds of thousands or millions of tokens, prefill latency becomes prohibitive and memory usage during decoding becomes the system bottleneck.

Previous attempts at KV cache compression fall into two main categories:

  • Decoding-only pruning methods: Such as StreamingLLM, SnapKV, and H2O, prune unimportant cache entries after prefill. While they reduce decoding-memory and traffic, they leave prefill computational cost unchanged.
  • Prefill-aware pruning methods: Such as GemFilter and PyramidInfer, aggressively reduce the effective prompt length during prefill to shrink both computation and cache. However, these approaches tie the prefill reduction rate to the cache compression rate across all layers, often necessitating early-layer pruning that irreversibly removes important context and causes accuracy degradation.

FastKV is designed to decouple these two rates, overcoming inherent limitations and accuracy trade-offs in prior frameworks (Jo et al., 3 Feb 2025).

2. FastKV Architecture and Token-Selective Propagation Mechanism

FastKV introduces a two-stage prefill pipeline with independent KV compression, enabling flexible trade-offs between speed, memory, and accuracy:

  • Stage 1 (Layers 1…ℓ_TSP): Each early Transformer layer executes full self-attention over the entire input prompt (N tokens), allowing the model to fully assess contextual dependencies and token significance.
  • Stage 2 (Layers ℓ_TSP+1…L): Subsequent layers process only a “salient” subset of tokens identified as important by the Token-Selective Propagation (TSP) criterion at ℓ_TSP, together with a window of most recent tokens. All downstream computation and stored KV cache depend only on this reduced set.

Token saliency at the TSP layer, ℓ_TSP, is computed by averaging each token's attention weight across heads from the perspective of the latest observed tokens:

Si(TSP)=1Hh=1Hsi(TSP,h),S_i^{(\ell_{TSP})} = \frac{1}{H} \sum_{h=1}^H s_i^{(\ell_{TSP},h)},

where

si(TSP,h)=Pool(n=0Nobs1AttTSP[h,NIn,i+m])s_i^{(\ell_{TSP},h)} = \text{Pool}\left(\sum_{n=0}^{N_{obs}-1} \text{Att}_{\ell_{TSP}}[h, N_I-n, i+m]\right)

The top RTSPNR_{TSP}\cdot N tokens by Si(TSP)S_i^{(\ell_{TSP})}, along with the most recent window, are propagated, forming the reduced index set P\mathcal{P} for subsequent layers.

Independent KV entry pruning is performed at each layer with a user-specified retention rate RKVR_{KV}, further reducing the cache footprint.

3. Decoupling Prefill Computation and KV Cache Budget

fastKV’s central design leverages two independent control knobs:

  • TSP rate RTSPR_{TSP}: Determines the ratio of tokens propagated after the TSP layer for post-ℓ_TSP computation. Lowering this rate directly reduces prefill cost by restricting the number of tokens processed in later layers.
  • KV retention rate RKVR_{KV}: Specifies the fraction of cached KV entries per layer, controlling inference-time memory and bandwidth. This parameter can be set orthogonally to RTSPR_{TSP}.

Each layer, regardless of its position relative to ℓ_TSP, selects top-scoring RKVR_{KV}\cdot (context size) entries for caching using a defined importance function f(K,V,Att)f(K, V, \text{Att}), e.g., based on average attention mass.

This decoupling allows practitioners to optimize for diverse objectives (e.g., aggressive decoding compression with conservative prefill or vice versa) without incurring the fundamental tradeoffs characteristic of previous methods (Jo et al., 3 Feb 2025).

FastKV Prefill and Decoding Algorithm (Sketch)

1
2
3
4
5
6
7
8
9
for layer in 1 .. L:
    if layer <= ℓ_TSP:
        X, Att, K, V = SelfAttentionLayer_layer(X)
    else:
        X, Att, K, V = SelfAttentionLayer_layer(X[P])  # Only salient/recent tokens
    K_pruned, V_pruned = top_k_by_importance(K, V, Att, R_KV)
    if layer == ℓ_TSP:
        P = select_salient_tokens(Att, R_TSP)  # Update token set
    store(K_pruned, V_pruned, layer)

4. Computational Efficiency and Scaling

FastKV results in substantial reductions in both prefill and decoding computation:

  • Prefill Compute:
    • Baseline: O(LN2d)O(L \cdot N^2 \cdot d)
    • FastKV: O(TSPN2d)+O((LTSP)(RTSPN)2d)O(\ell_{TSP}N^2d) + O((L-\ell_{TSP})(R_{TSP}N)^2 d)
    • For typical parameters (TSPL/2\ell_{TSP} \sim L/2, RTSP=20%R_{TSP}=20\%), net compute is about 60%60\% of the baseline.
  • Decoding Memory/Bandwidth: Only RKVNR_{KV}N entries per layer, leading to an order-of-magnitude lower memory traffic.

Empirical wall-clock measurements (LLaMA-8B with 128K context) show speedups of 1.82× for prefill and 2.87× for decoding compared to the full-context baseline.

5. Experimental Results and Comparative Performance

FastKV was evaluated on LLaMA-3.1-8B-Instruct and Mistral-8B-Instruct (32–36 layers, GQA attention, 128K context) across standardized long-context benchmarks:

  • LongBench: ≤1% average score drop relative to full-context. Competing prefill-aware methods drop 5–12%.
  • RULER (retrieval, 128K): Achieves 75.6% (FastKV) vs. 86.0% (full context) at 10% KV retention, outperforming decoding-only pruning.
  • Needle-In-A-Haystack: FastKV attains 99.9% retrieval accuracy vs. 99.0% for full baseline, indicating robust token retention in retrieval settings.
  • Latency: At 128K context, both prefill and decoding achieve over 2× speedup in wall-clock token-generation (Jo et al., 3 Feb 2025).

These performance metrics support FastKV's effectiveness in balancing cache compression with negligible degradation of long-context reasoning and retrieval capabilities.

6. Practical Implementation and Tuning

FastKV is implemented as a modular wrapper for existing Transformer stacks:

  • Integration: At each layer, the framework computes saliency and importance scores, compresses tokens at ℓ_TSP, and prunes KV entries at every layer.
  • Tuning ℓ_TSP: Set by minimizing the normalized L2 distance between final hidden states from FastKV and the full-context baseline across a calibration set, with a typical threshold ϵ103\epsilon \sim 10^{-3}.
  • Tuning RTSPR_{TSP} and RKVR_{KV}: Practically, start at RTSP=20%R_{TSP}=20\%, RKV=10%R_{KV}=10\%. These rates can be adjusted independently to target a desired compute/memory/accuracy tradeoff.
  • Compatibility: Orthogonal to low-level kernel optimizations (FlashAttention2/3), quantization schemes, batch size scaling, and paged attention engines (vLLM, SGLang), enabling stacking with state-of-the-art infrastructure.

FastKV is distinct from:

  • Quantization-focused methods (e.g., KVLinC, RotateKV, Kitty): These approaches compress the KV cache by reducing numerical bitwidth (e.g., 2-bit/4-bit per entry), whereas FastKV's primary mechanism is token-level pruning and selective propagation, not just bitwidth reduction (Saxena et al., 6 Oct 2025, Xia et al., 23 Nov 2025, Su et al., 25 Jan 2025).
  • Eviction-based two-stage methods (e.g., RocketKV): RocketKV employs prompt eviction (coarse) and hybrid sparse attention (fine) but does not perform quantization or decouple cache retention between prefill/decoding in the same manner (Behnam et al., 19 Feb 2025).

A potential implication is the straightforward combinability of FastKV's salient-propagation strategy with quantization or sparse-kernel techniques, yielding even greater memory and compute efficiency.

Table: Comparison of FastKV and Representative Prior Approaches

Approach Core Mechanism Prefill Acceleration Decoding Memory Saving Control Coupling Quantization
FastKV TSP pruning + decoupled KV Yes Yes Decoupled No
RocketKV Eviction + hybrid sparse attn No Yes Coupled No
KVLinC / Kitty Bitwidth quantization No Yes N/A Yes
StreamingLLM Decoding-only pruning No Yes Coupled No
GemFilter Aggressive prefill pruning Yes Yes Coupled No

This structural differentiation demonstrates FastKV's unique position in the landscape of KV compression schemes, offering both practical efficiency and robustness for long-context LLM deployment (Jo et al., 3 Feb 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FastKV.