FastKV: Efficient KV Cache Compression
- FastKV is a KV cache compression framework that decouples prefill computation from decoding memory reduction using independent TSP and KV retention rates.
- It employs a two-stage token-selective propagation mechanism ensuring early layers use full context while later layers focus on salient tokens.
- Experimental results demonstrate up to 2.87× speedup in decoding and minimal accuracy loss on long-context benchmarks.
FastKV is a KV cache compression framework for LLM inference, architected to accelerate both prefill and decoding stages under long-context scenarios. The method is designed around the insight that early model layers benefit from full-sequence context to determine critical dependencies, whereas later layers’ token importance stabilizes, enabling aggressive reduction in context and cache. FastKV explicitly decouples the reduction of prefill computational cost (via context propagation rate, or “TSP rate”) from the reduction of KV cache memory/traffic in decoding (“KV retention rate”), thus overcoming critical coupling limitations in prior works (Jo et al., 3 Feb 2025).
1. Motivation and Limitations of Previous Approaches
Long-context LLM inference involves two dominant computational bottlenecks: prefill (quadratic in prompt size due to full self-attention in each layer) and decoding (which demands linear memory bandwidth proportional to the growing KV cache). As prompt lengths scale into the hundreds of thousands or millions of tokens, prefill latency becomes prohibitive and memory usage during decoding becomes the system bottleneck.
Previous attempts at KV cache compression fall into two main categories:
- Decoding-only pruning methods: Such as StreamingLLM, SnapKV, and H2O, prune unimportant cache entries after prefill. While they reduce decoding-memory and traffic, they leave prefill computational cost unchanged.
- Prefill-aware pruning methods: Such as GemFilter and PyramidInfer, aggressively reduce the effective prompt length during prefill to shrink both computation and cache. However, these approaches tie the prefill reduction rate to the cache compression rate across all layers, often necessitating early-layer pruning that irreversibly removes important context and causes accuracy degradation.
FastKV is designed to decouple these two rates, overcoming inherent limitations and accuracy trade-offs in prior frameworks (Jo et al., 3 Feb 2025).
2. FastKV Architecture and Token-Selective Propagation Mechanism
FastKV introduces a two-stage prefill pipeline with independent KV compression, enabling flexible trade-offs between speed, memory, and accuracy:
- Stage 1 (Layers 1…ℓ_TSP): Each early Transformer layer executes full self-attention over the entire input prompt (N tokens), allowing the model to fully assess contextual dependencies and token significance.
- Stage 2 (Layers ℓ_TSP+1…L): Subsequent layers process only a “salient” subset of tokens identified as important by the Token-Selective Propagation (TSP) criterion at ℓ_TSP, together with a window of most recent tokens. All downstream computation and stored KV cache depend only on this reduced set.
Token saliency at the TSP layer, ℓ_TSP, is computed by averaging each token's attention weight across heads from the perspective of the latest observed tokens:
where
The top tokens by , along with the most recent window, are propagated, forming the reduced index set for subsequent layers.
Independent KV entry pruning is performed at each layer with a user-specified retention rate , further reducing the cache footprint.
3. Decoupling Prefill Computation and KV Cache Budget
fastKV’s central design leverages two independent control knobs:
- TSP rate : Determines the ratio of tokens propagated after the TSP layer for post-ℓ_TSP computation. Lowering this rate directly reduces prefill cost by restricting the number of tokens processed in later layers.
- KV retention rate : Specifies the fraction of cached KV entries per layer, controlling inference-time memory and bandwidth. This parameter can be set orthogonally to .
Each layer, regardless of its position relative to ℓ_TSP, selects top-scoring (context size) entries for caching using a defined importance function , e.g., based on average attention mass.
This decoupling allows practitioners to optimize for diverse objectives (e.g., aggressive decoding compression with conservative prefill or vice versa) without incurring the fundamental tradeoffs characteristic of previous methods (Jo et al., 3 Feb 2025).
FastKV Prefill and Decoding Algorithm (Sketch)
1 2 3 4 5 6 7 8 9 |
for layer in 1 .. L: if layer <= ℓ_TSP: X, Att, K, V = SelfAttentionLayer_layer(X) else: X, Att, K, V = SelfAttentionLayer_layer(X[P]) # Only salient/recent tokens K_pruned, V_pruned = top_k_by_importance(K, V, Att, R_KV) if layer == ℓ_TSP: P = select_salient_tokens(Att, R_TSP) # Update token set store(K_pruned, V_pruned, layer) |
4. Computational Efficiency and Scaling
FastKV results in substantial reductions in both prefill and decoding computation:
- Prefill Compute:
- Baseline:
- FastKV:
- For typical parameters (, ), net compute is about of the baseline.
- Decoding Memory/Bandwidth: Only entries per layer, leading to an order-of-magnitude lower memory traffic.
Empirical wall-clock measurements (LLaMA-8B with 128K context) show speedups of 1.82× for prefill and 2.87× for decoding compared to the full-context baseline.
5. Experimental Results and Comparative Performance
FastKV was evaluated on LLaMA-3.1-8B-Instruct and Mistral-8B-Instruct (32–36 layers, GQA attention, 128K context) across standardized long-context benchmarks:
- LongBench: ≤1% average score drop relative to full-context. Competing prefill-aware methods drop 5–12%.
- RULER (retrieval, 128K): Achieves 75.6% (FastKV) vs. 86.0% (full context) at 10% KV retention, outperforming decoding-only pruning.
- Needle-In-A-Haystack: FastKV attains 99.9% retrieval accuracy vs. 99.0% for full baseline, indicating robust token retention in retrieval settings.
- Latency: At 128K context, both prefill and decoding achieve over 2× speedup in wall-clock token-generation (Jo et al., 3 Feb 2025).
These performance metrics support FastKV's effectiveness in balancing cache compression with negligible degradation of long-context reasoning and retrieval capabilities.
6. Practical Implementation and Tuning
FastKV is implemented as a modular wrapper for existing Transformer stacks:
- Integration: At each layer, the framework computes saliency and importance scores, compresses tokens at ℓ_TSP, and prunes KV entries at every layer.
- Tuning ℓ_TSP: Set by minimizing the normalized L2 distance between final hidden states from FastKV and the full-context baseline across a calibration set, with a typical threshold .
- Tuning and : Practically, start at , . These rates can be adjusted independently to target a desired compute/memory/accuracy tradeoff.
- Compatibility: Orthogonal to low-level kernel optimizations (FlashAttention2/3), quantization schemes, batch size scaling, and paged attention engines (vLLM, SGLang), enabling stacking with state-of-the-art infrastructure.
7. Comparison to Related Techniques and Distinctions
FastKV is distinct from:
- Quantization-focused methods (e.g., KVLinC, RotateKV, Kitty): These approaches compress the KV cache by reducing numerical bitwidth (e.g., 2-bit/4-bit per entry), whereas FastKV's primary mechanism is token-level pruning and selective propagation, not just bitwidth reduction (Saxena et al., 6 Oct 2025, Xia et al., 23 Nov 2025, Su et al., 25 Jan 2025).
- Eviction-based two-stage methods (e.g., RocketKV): RocketKV employs prompt eviction (coarse) and hybrid sparse attention (fine) but does not perform quantization or decouple cache retention between prefill/decoding in the same manner (Behnam et al., 19 Feb 2025).
A potential implication is the straightforward combinability of FastKV's salient-propagation strategy with quantization or sparse-kernel techniques, yielding even greater memory and compute efficiency.
Table: Comparison of FastKV and Representative Prior Approaches
| Approach | Core Mechanism | Prefill Acceleration | Decoding Memory Saving | Control Coupling | Quantization |
|---|---|---|---|---|---|
| FastKV | TSP pruning + decoupled KV | Yes | Yes | Decoupled | No |
| RocketKV | Eviction + hybrid sparse attn | No | Yes | Coupled | No |
| KVLinC / Kitty | Bitwidth quantization | No | Yes | N/A | Yes |
| StreamingLLM | Decoding-only pruning | No | Yes | Coupled | No |
| GemFilter | Aggressive prefill pruning | Yes | Yes | Coupled | No |
This structural differentiation demonstrates FastKV's unique position in the landscape of KV compression schemes, offering both practical efficiency and robustness for long-context LLM deployment (Jo et al., 3 Feb 2025).