Sparse KV Cache for Scalable LLMs

Updated 2 March 2026

Sparse KV Cache is a memory management scheme that replaces dense storage with a compressed selection of key-value pairs to boost efficiency.
Techniques such as dynamic token selection, low-rank compression, and distributed sharding can reduce memory footprint by up to 10× while maintaining performance.
This approach enables long-context processing and scalable multi-GPU inference in transformers and mixture-of-experts models with minimal accuracy loss.

A sparse KV (Key-Value) cache is a memory management scheme for transformers and LLMs designed to drastically reduce the memory, communication, and computational overhead of storing and using past key–value states during inference and generation. Rather than retaining a dense history of every past token's K/V activations at every layer, sparse KV cache systems maintain a carefully selected or compressed subset, often using dynamic importance criteria or structured approximations. Recent advances leverage combinatorial eviction, low-rank or dictionary-based representations, token selection using attention statistics, and cache sharing across layers or expert partitions. Sparse KV caches are critical for scaling context lengths, supporting multi-GPU inference, and enabling practical deployment of large mixture-of-experts (MoE) and vision-LLMs.

1. Sparse KV Cache: Definitions and Motivations

A sparse KV cache replaces the full retention of all (K,V) pairs— $\{(K_i,V_i)\}_{i=1}^n$ —with a pruned or encoded subset $\{(K_i,V_i): i \in S\}$ , where $S \subset \{1, ..., n\}$ and $|S| \ll n$ (Li et al., 2024). Sparse caches can be implemented by (i) selecting a static/dynamic subset of past token positions, (ii) compressing representations (e.g., low-rank, dictionary codes), (iii) evicting low-importance entries per a cache management policy, or (iv) combining these approaches.

Key motivational factors:

Memory scaling: Full caches scale $\mathcal{O}(n)$ per layer/head; sparse caches can achieve up to 10-40 $\times$ reduction without substantial quality loss (Mu et al., 28 Oct 2025, Zhou et al., 2024, Behnam et al., 19 Feb 2025, Zhang et al., 2024).
Bandwidth and latency constraints: Reduced data transfer per attention call increases throughput, especially in multi-GPU/multi-node deployments (Liu et al., 2 Aug 2025, Jiang et al., 29 Oct 2025).
Specialized architectures: MoE and VLLMs require cache partitioning and sharding to match their execution patterns (Liu et al., 2 Aug 2025, Jiang et al., 29 Oct 2025).
Long-context capabilities: Efficient sparse caches enable context lengths in the 100k–1M range on commodity hardware (Zhou et al., 2024, Kim et al., 2024, Zhang et al., 2024, He et al., 9 Jan 2025, Gao et al., 3 Feb 2026).

2. Core Sparse KV Caching Methodologies

Sparse KV cache techniques can be grouped as follows:

Static and Dynamic Token Selection

Static patterns: Fixed rules such as A-shape, Tri-shape, sliding window, "sink tokens," or deterministic block sampling. Simpler implementation but degrade markedly in multi-turn or distribution-shifted use (Li et al., 2024).
Dynamic patterns: Use per-token scores (e.g., attention saliency, frequency, recency, redundancy) to retain or evict cache entries on the fly. Methods like DynamicKV, PiKV Scheduling, and MInference adapt allocation per layer, task, or session, typically using top-k or thresholding (Jiang et al., 29 Oct 2025, Liu et al., 2 Aug 2025, Zhou et al., 2024, Zhao et al., 2024).

Low-Rank, Dictionary, and Sparse Coding

Low-rank latent codes: Project KV tensors into a compact subspace, use RoPE-free metric for token selection; full reconstruction only for selected tokens (SALS) (Mu et al., 28 Oct 2025).
Sparse coding & dictionaries: Each KV vector is represented as a sparse linear combination over a small, pre-trained dictionary (Lexico, CSR); enables fixed-ratio compression with input-agnostic dictionaries (Kim et al., 2024, Zhang et al., 2024).
Residual/delta encoding: Represent each KV as a compressed residual relative to a small set of similar historical references (DeltaKV) (Hao et al., 8 Feb 2026).

Scheduling and Sharding

Adaptive scheduling: Retain or evict tokens/pages based on utility scoring (e.g., AdaKV, LRU+, QUEST), often per-expert and per-GPU in distributed MoE systems such as PiKV (Liu et al., 2 Aug 2025).
Hybrid block-layered and reuse architectures: Share (K,V) buffers across multiple sparse layers using a "KV oracle" from prior full self-attention layers (HySparse), achieving order-of-magnitude reductions in KV storage (Gao et al., 3 Feb 2026).

Cache Compression and Acceleration

Hierarchical/Modular Compression: Systems such as PiKV can switch between LoRA, SVD, block-PCA, or distillation in the cache serving pipeline, allowing multiple selectable schemes (Liu et al., 2 Aug 2025).
Decompression-free pruning: SWAN prunes the rotated representations directly and supports decompressionless access for efficiency (S et al., 24 Nov 2025).
Efficient attention kernel integration: Compatibility with FlashAttention, CUDA sparse kernels, and frameworks like Nvidia kvpress or Sparse-vLLM is prioritized (Liu et al., 2 Aug 2025, S et al., 24 Nov 2025, Hao et al., 8 Feb 2026).

3. Algorithmic Approaches and System Architecture

A wide spectrum of algorithmic designs underpin sparse KV caches. Representative methods include:

Method	Token Selection	Compression/Pruning	Layer/Expert Sharing
PiKV	Routing + adaptive scheduling	Hierarchical (LoRA, SVD, Pyramid)	Expert-sharded, distributed
PureKV	Lower-layer attention, cross-layer V-norm	Pruning w/ recent+top-h window	Compatible with efficient accelerator layers
SALS	Top-k in latent (low-rank) space	Per-token latent projection; reconstruct only important	Reconstruct as needed, no full cache
HashEvict	LSH-based pre-attention dissimilarity	Dynamic token eviction	N/A
HySparse	Full-attn “oracle” provides block-level importance	KV cache reused across block layers	Sparse layers share blockwise full-attn KV
DynamicKV	Prefill attention statistics, periodically updated budget per layer	Adaptive per-layer budget	Automatic cross-layer allocation
LESS	Integrates low-rank accumulator for “evicted” K/V info	Coupled with any sparse eviction	Recovers all history, no unqueryable tokens
CSR/Lexico	Matching pursuit over learned dictionary	Sparse code; 1–4 atoms per token	Layer-merged and online-updated dictionaries
SWAN	Top-k dimension mask after orthogonal rotation	All KVs stored in sparse format (CSR)	Small dense buffer + sparse historical cache

These methods are typically integrated into production codebases and inference frameworks, with substantial emphasis placed on memory manager efficiency, GPU kernel fusion, and plug-and-play with existing high-throughput attention backends (FlashAttention, vLLM, kvpress) (Liu et al., 2 Aug 2025, Hao et al., 8 Feb 2026, S et al., 24 Nov 2025).

4. Empirical Performance and Quality Trade-offs

Rigorous experimental evaluation on standard benchmarks demonstrates that sparse KV cache methods can achieve significant reductions in GPU memory footprint, attention and I/O cost, and end-to-end latency, with minimal or controlled degradation of accuracy and perplexity. Notable findings:

Memory savings: PiKV (LoRA+AdaKV) reaches $3.9\times$ reduction ($6.2$ GB/GPU from $24$ GB baseline); HySparse achieves up to $10\times$ (from $49N$ to $5N+44k$ total entries); Lexico and CSR reach $10$– $16\times$ at 1–2 bits per channel (Liu et al., 2 Aug 2025, He et al., 9 Jan 2025, Zhang et al., 2024, Gao et al., 3 Feb 2026).
Latency and throughput: PiKV (1.7 $\times$ faster at 64k context); PureKV (3.16 $\times$ prefill speedup); SALS (5.7 $\times$ operator-level, 1.4–4.5 $\times$ end-to-end); DeltaKV (up to $2\times$ faster than vLLM for $512$k contexts) (Liu et al., 2 Aug 2025, Jiang et al., 29 Oct 2025, Mu et al., 28 Oct 2025, Hao et al., 8 Feb 2026).
Accuracy retention: <1–2% absolute drop in ROUGE or QA F1/accuracy is typical at $2$– $5\times$ compression; Lexico sustains $90$–$95$% performance at $15$–$25$% memory (Kim et al., 2024).
Scalability: PiKV, HySparse, and DeltaKV support linear or near-linear throughput scaling in multi-GPU deployment (Liu et al., 2 Aug 2025, Gao et al., 3 Feb 2026, Hao et al., 8 Feb 2026).
Robustness: TreeKV preserves high performance under distribution shift and in extremely long-sequence settings, outperforming static or globally-biased strategies (He et al., 9 Jan 2025). DynamicKV achieves 85% retention at 1.7% cache (Zhou et al., 2024); BUZZ maintains >99% summarization accuracy at 2.5 $\times$ memory reduction (Zhao et al., 2024).

5. Specialized Sparse KV Cache Systems and Architectural Variants

Distributed and Mixture-of-Experts (MoE) Systems

PiKV introduces expert-sharded caching, distributed across GPUs using hash-based sharding and token–expert routing. This supports both memory partitioning (each GPU holds only a fraction of the total cache by expert and token index) and accelerated communication for large-scale MoEs (Liu et al., 2 Aug 2025).

HySparse and PureKV exploit cross-layer and cross-block sharing: sparse attention layers reuse a reduced KV cache emitted by a periodic full-attention layer, often with Top-K block or token selection to maximize reuse and minimize redundant storage (Gao et al., 3 Feb 2026, Jiang et al., 29 Oct 2025).

Vision-Language and Video Models

PureKV addresses joint spatial–temporal attention sparsity and cache pruning for VLLM architectures, with specific masking to respect both video frame and sequence locality, and per-layer importance transfer (using lower-layer statistics for higher layers) compatible with FlashAttention (Jiang et al., 29 Oct 2025).

Quantization, Dictionary, and Latent-Encoding Approaches

LeanKV uses per-token and per-head dynamic thresholding, differing quantization degree for K vs. V, and specialized on-GPU compaction to translate head and token sparsity into throughput, outperforming static or globally uniform pruning/quantization (Zhang et al., 2024). CSR and Lexico provide learned, highly compressive sparse encodings via matching pursuit and universal dictionaries, achieving robust compression even at 1 bit/channel densities (Zhang et al., 2024, Kim et al., 2024).

Tree-Structured and Beehive Caches

TreeKV encodes long-range tokens coarsely and recent tokens with dense resolution using a moving, cyclic eviction scheme, equivalent to building an implicit wavelet (dyadic) tree. This allows smooth importance decay and multi-resolution retention, in contrast to regionally or query-biased strategies (He et al., 9 Jan 2025). BUZZ segments cached tokens into stride-aligned blocks and prunes by localized heavy-hitter selection within each block, maintaining sliding window fidelity for recency (Zhao et al., 2024).

Residual and Similarity-Based Compression

DeltaKV stores each new KV as a quantized residual from the mean of its k-nearest neighbors within a sparsely-sampled reference set, providing substantial compression while being robust in long-range, multi-turn, and irregular token dependencies. Hardware-optimized engines such as Sparse-vLLM and PiKVpress provide fused kernels and indirect memory access for non-contiguous, compressed cache layouts (Hao et al., 8 Feb 2026, Liu et al., 2 Aug 2025).

6. Extensions, Limitations, and Deployment Considerations

Key design and practical considerations for deploying sparse KV cache systems include:

Adaptive vs. fixed budgets: Dynamic policies (DynamicKV, AdaKV) outperform static retention or block-size allocation, particularly on complex, unpredictable workloads (Zhou et al., 2024, Liu et al., 2 Aug 2025).
Accuracy vs. compression trade-offs: All current techniques exhibit a graceful degradation curve, with sharp loss only when approaching extreme compression or omitting cross-layer sharing or dynamic buffer resizing (Mu et al., 28 Oct 2025, Kim et al., 2024, Liu et al., 2 Aug 2025).
Compatibility with optimized kernels: Preference for designs amenable to CSR/sparse-dense matrix multiplication and direct integration with accelerator-resident kernels (S et al., 24 Nov 2025, Hao et al., 8 Feb 2026).
Multi-turn/followup robustness: Dynamic and hybrid approaches adapt best to shifting token importance in long or multi-turn interactions (Li et al., 2024, Zhou et al., 2024).
Hardware scalability: Distributed sharding, parallel scheduling, and efficient memory remapping are essential for scaling to multi-channel, multi-node, and multi-expert serving (Liu et al., 2 Aug 2025, Zhang et al., 2024).
Training-free vs. model-specific truncation: Most sparse KV cache systems require no model retraining (e.g., SWAN, TreeKV, CSR, Lexico) and can be plugged in at inference for any compatible transformer; others may require calibration runs or offline dictionary construction (S et al., 24 Nov 2025, He et al., 9 Jan 2025, Kim et al., 2024, Zhang et al., 2024).

7. Summary Table: Representative Sparse KV Cache Approaches

System/Method	Token Selection	Compression/Encoding	Key Features / Reference
PiKV	Adaptive scheduling, routing	Hierarchical compression	MoE, expert-sharded, distributed (Liu et al., 2 Aug 2025)
PureKV	Cross-layer attn/V-norm	Pruning + spatial-temporal masking	VLLMs, FlashAttention compatible (Jiang et al., 29 Oct 2025)
SALS	Top-k in latent space	Low-rank projection, selective reconstruct	SOTA memory and kernel speed (Mu et al., 28 Oct 2025)
HySparse	Oracle token selection	Blockwise cache sharing	Layer-hybrid (full/sparse), 10× reduction (Gao et al., 3 Feb 2026)
DynamicKV	Attention-based dynamic	Per-layer, periodic allocation	Task-aware, extreme compression (Zhou et al., 2024)
HashEvict	LSH Hamming pre-attention	Dynamic token eviction	Lightweight, 30–70% reduction (Liu et al., 2024)
LESS	Any eviction policy + low-rank accumulator	Low-rank memory absorbs evictions	No irrecoverable drops, all tokens queryable (Dong et al., 2024)
SWAN	Top-k dimension mask post-rotation	Direct inference with CSR format	Decompression-free KV access (S et al., 24 Nov 2025)
Lexico/CSR	Matching pursuit, dictionary	1–4 term sparse code per KV	1–2bit/channel, universal across tasks (Kim et al., 2024, Zhang et al., 2024)
BUZZ	Partitioned local max/interval	Beehive, sliding window	$O(\log N)$ time, high long-context quality (Zhao et al., 2024)
RocketKV	SnapKV++ + hybrid top- $k$	Two-stage, training-free	3.7× speedup at 1.1% drop (Behnam et al., 19 Feb 2025)
TreeKV	Cyclic attn-importance eviction	Tree-structured multi-scale	Smooth retention, wavelet-motivated (He et al., 9 Jan 2025)
DeltaKV	Reference-based residual	Per-token mean+quantized delta	Sparse-vLLM kernel suite (Hao et al., 8 Feb 2026)
LeanKV	Attention-sig(budgeted) per token/head	Hybrid quantization/prune	On-GPU compaction/parallelization (Zhang et al., 2024)

In conclusion, the sparse KV cache paradigm is now fundamental for efficient inference and serving of long-context LLMs, complex MoE architectures, and VLLMs at scale. The rich landscape ranges from rigid static patterns to multi-level dynamic/adaptive schemes, orthogonal projection and sparse coding, to cross-layer cache sharing and residual-based encodings. State-of-the-art systems consistently deliver order-of-magnitude memory savings with small and controllable performance impact, and offer operators multiple dimensions—budget, compression method, and retention policy—to optimally trade accuracy for efficiency in large-scale deployment.