Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sparse KV Cache for Scalable LLMs

Updated 2 March 2026
  • Sparse KV Cache is a memory management scheme that replaces dense storage with a compressed selection of key-value pairs to boost efficiency.
  • Techniques such as dynamic token selection, low-rank compression, and distributed sharding can reduce memory footprint by up to 10× while maintaining performance.
  • This approach enables long-context processing and scalable multi-GPU inference in transformers and mixture-of-experts models with minimal accuracy loss.

A sparse KV (Key-Value) cache is a memory management scheme for transformers and LLMs designed to drastically reduce the memory, communication, and computational overhead of storing and using past key–value states during inference and generation. Rather than retaining a dense history of every past token's K/V activations at every layer, sparse KV cache systems maintain a carefully selected or compressed subset, often using dynamic importance criteria or structured approximations. Recent advances leverage combinatorial eviction, low-rank or dictionary-based representations, token selection using attention statistics, and cache sharing across layers or expert partitions. Sparse KV caches are critical for scaling context lengths, supporting multi-GPU inference, and enabling practical deployment of large mixture-of-experts (MoE) and vision-LLMs.

1. Sparse KV Cache: Definitions and Motivations

A sparse KV cache replaces the full retention of all (K,V) pairs—{(Ki,Vi)}i=1n\{(K_i,V_i)\}_{i=1}^n—with a pruned or encoded subset {(Ki,Vi):iS}\{(K_i,V_i): i \in S\}, where S{1,...,n}S \subset \{1, ..., n\} and Sn|S| \ll n (Li et al., 2024). Sparse caches can be implemented by (i) selecting a static/dynamic subset of past token positions, (ii) compressing representations (e.g., low-rank, dictionary codes), (iii) evicting low-importance entries per a cache management policy, or (iv) combining these approaches.

Key motivational factors:

2. Core Sparse KV Caching Methodologies

Sparse KV cache techniques can be grouped as follows:

Static and Dynamic Token Selection

  • Static patterns: Fixed rules such as A-shape, Tri-shape, sliding window, "sink tokens," or deterministic block sampling. Simpler implementation but degrade markedly in multi-turn or distribution-shifted use (Li et al., 2024).
  • Dynamic patterns: Use per-token scores (e.g., attention saliency, frequency, recency, redundancy) to retain or evict cache entries on the fly. Methods like DynamicKV, PiKV Scheduling, and MInference adapt allocation per layer, task, or session, typically using top-k or thresholding (Jiang et al., 29 Oct 2025, Liu et al., 2 Aug 2025, Zhou et al., 2024, Zhao et al., 2024).

Low-Rank, Dictionary, and Sparse Coding

  • Low-rank latent codes: Project KV tensors into a compact subspace, use RoPE-free metric for token selection; full reconstruction only for selected tokens (SALS) (Mu et al., 28 Oct 2025).
  • Sparse coding & dictionaries: Each KV vector is represented as a sparse linear combination over a small, pre-trained dictionary (Lexico, CSR); enables fixed-ratio compression with input-agnostic dictionaries (Kim et al., 2024, Zhang et al., 2024).
  • Residual/delta encoding: Represent each KV as a compressed residual relative to a small set of similar historical references (DeltaKV) (Hao et al., 8 Feb 2026).

Scheduling and Sharding

  • Adaptive scheduling: Retain or evict tokens/pages based on utility scoring (e.g., AdaKV, LRU+, QUEST), often per-expert and per-GPU in distributed MoE systems such as PiKV (Liu et al., 2 Aug 2025).
  • Hybrid block-layered and reuse architectures: Share (K,V) buffers across multiple sparse layers using a "KV oracle" from prior full self-attention layers (HySparse), achieving order-of-magnitude reductions in KV storage (Gao et al., 3 Feb 2026).

Cache Compression and Acceleration

3. Algorithmic Approaches and System Architecture

A wide spectrum of algorithmic designs underpin sparse KV caches. Representative methods include:

Method Token Selection Compression/Pruning Layer/Expert Sharing
PiKV Routing + adaptive scheduling Hierarchical (LoRA, SVD, Pyramid) Expert-sharded, distributed
PureKV Lower-layer attention, cross-layer V-norm Pruning w/ recent+top-h window Compatible with efficient accelerator layers
SALS Top-k in latent (low-rank) space Per-token latent projection; reconstruct only important Reconstruct as needed, no full cache
HashEvict LSH-based pre-attention dissimilarity Dynamic token eviction N/A
HySparse Full-attn “oracle” provides block-level importance KV cache reused across block layers Sparse layers share blockwise full-attn KV
DynamicKV Prefill attention statistics, periodically updated budget per layer Adaptive per-layer budget Automatic cross-layer allocation
LESS Integrates low-rank accumulator for “evicted” K/V info Coupled with any sparse eviction Recovers all history, no unqueryable tokens
CSR/Lexico Matching pursuit over learned dictionary Sparse code; 1–4 atoms per token Layer-merged and online-updated dictionaries
SWAN Top-k dimension mask after orthogonal rotation All KVs stored in sparse format (CSR) Small dense buffer + sparse historical cache

These methods are typically integrated into production codebases and inference frameworks, with substantial emphasis placed on memory manager efficiency, GPU kernel fusion, and plug-and-play with existing high-throughput attention backends (FlashAttention, vLLM, kvpress) (Liu et al., 2 Aug 2025, Hao et al., 8 Feb 2026, S et al., 24 Nov 2025).

4. Empirical Performance and Quality Trade-offs

Rigorous experimental evaluation on standard benchmarks demonstrates that sparse KV cache methods can achieve significant reductions in GPU memory footprint, attention and I/O cost, and end-to-end latency, with minimal or controlled degradation of accuracy and perplexity. Notable findings:

5. Specialized Sparse KV Cache Systems and Architectural Variants

Distributed and Mixture-of-Experts (MoE) Systems

PiKV introduces expert-sharded caching, distributed across GPUs using hash-based sharding and token–expert routing. This supports both memory partitioning (each GPU holds only a fraction of the total cache by expert and token index) and accelerated communication for large-scale MoEs (Liu et al., 2 Aug 2025).

HySparse and PureKV exploit cross-layer and cross-block sharing: sparse attention layers reuse a reduced KV cache emitted by a periodic full-attention layer, often with Top-K block or token selection to maximize reuse and minimize redundant storage (Gao et al., 3 Feb 2026, Jiang et al., 29 Oct 2025).

Vision-Language and Video Models

PureKV addresses joint spatial–temporal attention sparsity and cache pruning for VLLM architectures, with specific masking to respect both video frame and sequence locality, and per-layer importance transfer (using lower-layer statistics for higher layers) compatible with FlashAttention (Jiang et al., 29 Oct 2025).

Quantization, Dictionary, and Latent-Encoding Approaches

LeanKV uses per-token and per-head dynamic thresholding, differing quantization degree for K vs. V, and specialized on-GPU compaction to translate head and token sparsity into throughput, outperforming static or globally uniform pruning/quantization (Zhang et al., 2024). CSR and Lexico provide learned, highly compressive sparse encodings via matching pursuit and universal dictionaries, achieving robust compression even at 1 bit/channel densities (Zhang et al., 2024, Kim et al., 2024).

Tree-Structured and Beehive Caches

TreeKV encodes long-range tokens coarsely and recent tokens with dense resolution using a moving, cyclic eviction scheme, equivalent to building an implicit wavelet (dyadic) tree. This allows smooth importance decay and multi-resolution retention, in contrast to regionally or query-biased strategies (He et al., 9 Jan 2025). BUZZ segments cached tokens into stride-aligned blocks and prunes by localized heavy-hitter selection within each block, maintaining sliding window fidelity for recency (Zhao et al., 2024).

Residual and Similarity-Based Compression

DeltaKV stores each new KV as a quantized residual from the mean of its k-nearest neighbors within a sparsely-sampled reference set, providing substantial compression while being robust in long-range, multi-turn, and irregular token dependencies. Hardware-optimized engines such as Sparse-vLLM and PiKVpress provide fused kernels and indirect memory access for non-contiguous, compressed cache layouts (Hao et al., 8 Feb 2026, Liu et al., 2 Aug 2025).

6. Extensions, Limitations, and Deployment Considerations

Key design and practical considerations for deploying sparse KV cache systems include:

7. Summary Table: Representative Sparse KV Cache Approaches

System/Method Token Selection Compression/Encoding Key Features / Reference
PiKV Adaptive scheduling, routing Hierarchical compression MoE, expert-sharded, distributed (Liu et al., 2 Aug 2025)
PureKV Cross-layer attn/V-norm Pruning + spatial-temporal masking VLLMs, FlashAttention compatible (Jiang et al., 29 Oct 2025)
SALS Top-k in latent space Low-rank projection, selective reconstruct SOTA memory and kernel speed (Mu et al., 28 Oct 2025)
HySparse Oracle token selection Blockwise cache sharing Layer-hybrid (full/sparse), 10× reduction (Gao et al., 3 Feb 2026)
DynamicKV Attention-based dynamic Per-layer, periodic allocation Task-aware, extreme compression (Zhou et al., 2024)
HashEvict LSH Hamming pre-attention Dynamic token eviction Lightweight, 30–70% reduction (Liu et al., 2024)
LESS Any eviction policy + low-rank accumulator Low-rank memory absorbs evictions No irrecoverable drops, all tokens queryable (Dong et al., 2024)
SWAN Top-k dimension mask post-rotation Direct inference with CSR format Decompression-free KV access (S et al., 24 Nov 2025)
Lexico/CSR Matching pursuit, dictionary 1–4 term sparse code per KV 1–2bit/channel, universal across tasks (Kim et al., 2024, Zhang et al., 2024)
BUZZ Partitioned local max/interval Beehive, sliding window O(logN)O(\log N) time, high long-context quality (Zhao et al., 2024)
RocketKV SnapKV++ + hybrid top-kk Two-stage, training-free 3.7× speedup at 1.1% drop (Behnam et al., 19 Feb 2025)
TreeKV Cyclic attn-importance eviction Tree-structured multi-scale Smooth retention, wavelet-motivated (He et al., 9 Jan 2025)
DeltaKV Reference-based residual Per-token mean+quantized delta Sparse-vLLM kernel suite (Hao et al., 8 Feb 2026)
LeanKV Attention-sig(budgeted) per token/head Hybrid quantization/prune On-GPU compaction/parallelization (Zhang et al., 2024)

In conclusion, the sparse KV cache paradigm is now fundamental for efficient inference and serving of long-context LLMs, complex MoE architectures, and VLLMs at scale. The rich landscape ranges from rigid static patterns to multi-level dynamic/adaptive schemes, orthogonal projection and sparse coding, to cross-layer cache sharing and residual-based encodings. State-of-the-art systems consistently deliver order-of-magnitude memory savings with small and controllable performance impact, and offer operators multiple dimensions—budget, compression method, and retention policy—to optimally trade accuracy for efficiency in large-scale deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse KV Cache.