PiKV: Efficient KV Cache for MoE
- PiKV is a key-value cache management system for mixture-of-experts architectures that reduces memory and communication bottlenecks during long-context inference.
- It integrates expert-sharded storage, cache-aware routing, adaptive scheduling, and compression to efficiently serve sparse MoE models in distributed multi-GPU setups.
- Empirical results show up to 3.9× memory reduction and 1.7× latency improvement, demonstrating its practical benefits in large-scale MoE deployments.
Searching arXiv for PiKV and related entries to ground the article in current papers. Searching arXiv for "PiKV KV Cache Management System for Mixture of Experts". PiKV is a KV-cache management system for sparse Mixture-of-Experts (MoE) LLMs that treats the key-value cache as a distributed, query-driven, optimized data store rather than as a passive attention buffer. It is presented as a parallel and distributed KV cache serving framework tailored to MoE architecture, with four core elements: expert-sharded KV storage, PiKV routing, PiKV Scheduling, and PiKV Compression (Liu et al., 2 Aug 2025). Its stated objective is to reduce the memory and communication bottlenecks created by long-context autoregressive inference, especially in multi-GPU and multi-node settings where MoE computation is sparse but KV storage typically remains dense and globally synchronized.
1. Concept and motivation
PiKV is designed for the setting in which LLMs scale in both parameter count and context length, causing KV cache storage to become a dominant systems bottleneck. In autoregressive inference, each token attends to all previous tokens through cached keys and values,
With sequence length , hidden dimension , and multi-head attention, the in-memory KV cache grows as per layer. The paper gives a concrete example of a “7B-scale MoE model with 128K context and 16 experts” whose full KV cache uses more than 24GB (Liu et al., 2 Aug 2025).
The system is motivated by a structural mismatch in MoE inference. Sparse MoE layers activate only a small subset of experts for each token, but conventional KV caching remains dense. For a token sequence , an MoE router selects top- experts , and attention is written as
This yields sparse computation, but the KV caches are typically still stored as full, unpartitioned histories and are often replicated or synchronized across devices. PiKV is defined against this paradox: computation is sparse, but KV storage and access patterns are dense and global (Liu et al., 2 Aug 2025).
The framework therefore co-designs three decisions that are often treated independently: how tokens are routed to experts and KV shards, how KV entries are compressed, and how KV entries are retained or evicted under memory pressure. The paper formulates this as a joint optimization over routing , compression , and scheduling 0,
1
A plausible implication is that PiKV should be understood less as a single compression algorithm than as a serving substrate for MoE inference (Liu et al., 2 Aug 2025).
2. System architecture
PiKV is organized around four modules that jointly mediate all KV reads and writes during decoding.
| Module | Function | Main objective |
|---|---|---|
| Expert-sharded KV storage | Partition KV by token index and expert ID across devices | Lower per-GPU memory and improve locality |
| PiKV Routing | Choose experts in a cache-aware manner | Reduce token-to-KV access cost |
| PiKV Scheduling | Retain or evict KV pages adaptively | Stay within memory budget with high hit-rate |
| PiKV Compression | Compress KV representations on insertion | Reduce storage and read bandwidth |
The execution framework is described as an asynchronous decoding loop. Given query stream 2, expert set 3, and shard size 4, PiKV initializes a distributed cache, routing policy, scheduler, and compressor, then iterates over decoding steps:
8
In this arrangement, PiKV sits between the attention layers and device memory: Shard(t, e) maps token-expert pairs to shards, C[e] [s] denotes per-expert per-shard KV buffers, the compressor acts on insertion, the scheduler manages cache lifetime, and attention consumes only the relevant sharded cache (Liu et al., 2 Aug 2025).
This architecture is explicitly targeted at distributed MoE deployments. The paper states that PiKV is an open-source software library and also reports an integration with Nvidia kvpress through PiKVpress (Liu et al., 2 Aug 2025).
3. Sharding and routing mechanisms
The storage layer replaces dense replicated KV with expert-sharded KV storage. Instead of storing the full sequence history on every device, PiKV partitions KV entries by both token index and expert ID, assigning each shard to a specific GPU. Given a KV pair 5 at time 6 for expert 7, the shard index is defined as
8
where 9 is the number of token shards and 0 is the number of expert shards. Each shard is implemented as a circular buffer with capacity 1, allowing 2 insertion and overwrite of the oldest entries when full (Liu et al., 2 Aug 2025).
If compression reduces the KV width from 3 to 4, the memory of a shard is
5
This localizes KV ownership to particular devices, but it also introduces the possibility of remote fetches when a routed expert’s shard resides elsewhere. PiKV addresses that systems problem through cache-aware routing and scheduling rather than through storage alone (Liu et al., 2 Aug 2025).
PiKV Routing extends ordinary MoE gating into a cache-aware routing policy. The routing function is written as 6 with 7 and 8. The paper enumerates several routing mechanisms, including base hash or round-robin, TopK softmax, TopK with load balance, cache-aware PiKVRouter, entropy-penalized load balancing, RL-adaptive gating, and hierarchical coarse-to-fine routing (Liu et al., 2 Aug 2025).
In PiKVRouter, expert scoring is modified by a cache-miss penalty:
9
This expresses the central design choice of PiKV routing: prefer experts whose caches are warm, while still respecting MoE gating. The paper also presents load-balancing penalties and entropy penalties, suggesting that routing in PiKV is a systems-level policy rather than a purely modeling-level decision (Liu et al., 2 Aug 2025).
For memory traffic, the paper contrasts dense and sparse MoE attention. With dense access,
0
whereas with sparse PiKV routing,
1
The reduction factor is therefore approximately 2. The paper further states a heuristic cache hit-rate approximation of 3, and derives a throughput scaling factor
4
These expressions formalize PiKV’s claim that sparse expert activation should induce sparse KV access as well (Liu et al., 2 Aug 2025).
4. Scheduling and compression
PiKV Scheduling governs which KV pages remain in GPU memory under a fixed budget. The paper describes page-based scheduling rather than token-only policies, aligning the eviction unit with the sharded storage layout. Each page or token group is assigned a utility 5, and low-utility pages are evicted. Enumerated policies include attention-based scoring (6), age-based scheduling, MLP-based scoring (7), planning-based scheduling, LRU, LRU+, AdaKV, and Duo (Liu et al., 2 Aug 2025).
The adaptive case is summarized as
8
where 9 are features such as recency, frequency, and attention, 0 is the measured cache hit-rate, and 1 is the target hit-rate. The stated purpose is to maintain a high hit-rate while respecting a memory ceiling (Liu et al., 2 Aug 2025).
The memory model decomposes total per-GPU KV memory into token-buffer and page-buffer terms:
2
3
so that
4
Minimizing with respect to shard capacity 5 yields
6
and the corresponding minimum
7
This is one of the key analytical design rules in the PiKV system (Liu et al., 2 Aug 2025).
PiKV Compression reduces each KV vector before storage:
8
The reconstruction error is defined as
9
with decoder 0. The paper lists LoRA, LoRA++, PyramidKV, ChunkKV, truncated SVD, FastV, distillation, and structured pruning as supported compression families (Liu et al., 2 Aug 2025).
The latency model separates read and decode costs:
1
2
and thus
3
For compression ratios 4, the idealized speedup is
5
The paper therefore treats compression as a first-class systems control knob rather than as a secondary optimization (Liu et al., 2 Aug 2025).
5. Analytical behavior and empirical results
The analytical framing of PiKV centers on memory, communication, and throughput. Relative to dense replicated KV, the paper states that PiKV’s sharded and compressed memory grows according to
6
with sublinear square-root behavior at the optimal shard size. The communication analysis is not summarized by a single closed-form, but the system analysis reports that PiKV communication patterns are 58–62% local, 28–32% remote, and 8–12% broadcast/sync, with 45–52% communication efficiency improvement compared to naive distributed KV (Liu et al., 2 Aug 2025).
The experimental evaluation covers Switch-Transformer-1.6T, GLaM-1.2T, PaLM-540B in expert-parallel form, and Mixtral-8x7B on multi-GPU clusters with NVLink and InfiniBand HDR200. Across these models, the paper reports up to 3.9× memory reduction, up to 1.7× latency improvement, and competitive accuracy (Liu et al., 2 Aug 2025).
A model-level summary given in the paper is as follows.
| Model | Throughput | Memory | Accuracy drop | Latency |
|---|---|---|---|---|
| Switch-1.6T | 2.8× | 3.2× | 1.2% | 2.1× |
| GLaM-1.2T | 2.3× | 2.9× | 0.8% | 1.9× |
| PaLM-540B | 3.1× | 3.5× | 1.5% | 2.4× |
| Mixtral-8x7B | 2.5× | 2.8× | 1.1% | 2.0× |
On LongBench, the paper reports accuracy within about 1.2% of full KV on all datasets, memory reduction around 2.9×, and throughput improvement around 2.4×. For NarrativeQA and HotpotQA, it reports 77.2% and 73.3% for PiKV, compared with 66.3% and 60.0% for H2O (Liu et al., 2 Aug 2025).
The paper also emphasizes scaling with context length. For Switch-1.6T, throughput improvement grows from about 1.4× at 4K context to about 2.7× at 64K, and latency is described as sublinear in sequence growth, whereas competitors show near-quadratic scaling (Liu et al., 2 Aug 2025).
Ablation studies attribute different benefits to the individual modules. Adaptive routing reduces accuracy degradation from about 1.3% to about 0.6% versus baseline gating. Compression provides the largest memory gains, up to about 2.8× memory reduction, with about 1.9% accuracy drop at aggressive settings. Scheduling improves KV cache hit-rate from about 78% to about 94%. The combined system yields about 1.9× memory reduction, about 20% latency improvement, about 89% hit-rate, and about 1.0–1.2% accuracy drop from full KV (Liu et al., 2 Aug 2025).
6. Deployment, limitations, and nomenclature
PiKV is explicitly positioned as an inference-time framework for MoE architectures rather than as a training-time method. The paper states that it is most beneficial when context lengths are large, especially 32K–128K, and when multi-GPU or multi-node deployments are dominated by KV-memory footprint and KV-related communication (Liu et al., 2 Aug 2025). The practical guidance recommends using expert-sharded KV with the provided shard mapping, setting shard capacity close to
7
starting from TopK softmax or load-balanced TopK routing, enabling cache-aware PiKVRouter when cache misses are high, and choosing scheduling and compression policies according to the desired trade-off between accuracy and memory savings (Liu et al., 2 Aug 2025).
The limitations identified in the paper are primarily systems-level. For small workloads or low concurrency, the overhead of routing, complex scheduling, and compression may outweigh the benefit. Very aggressive compression can cause noticeable accuracy degradation, reaching 10–13% in extreme experiments. Complex routing strategies such as RL-based or hierarchical methods may introduce extra overhead. The paper also notes edge cases in which irregular routing or skewed expert use can create hot-spot shards or communication spikes if routing is not well tuned (Liu et al., 2 Aug 2025).
Future work is framed around online adaptation, hierarchical memory tiers such as GPU+CPU+NVMe, and tighter integration with training-time sparsity. The project is described as “a living project” aiming to become “a comprehesive KV Cache management system for MoE Architectures” (Liu et al., 2 Aug 2025).
Within the broader “PiK*” naming landscape, PiKV is distinct from several unrelated methods whose names can appear superficially similar. “Parameter-Inverted Image Pyramid Networks” defines PIIP for multi-scale vision backbones rather than KV caching (Zhu et al., 2024). “PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch” introduces PiKa alignment datasets and explicitly notes that there is no component in that paper called “PiKV” (Yin et al., 8 Oct 2025). “APIK: Active Physics-Informed Kriging Model with Partial Differential Equations” uses PIK and APIK for PDE-informed Gaussian-process modeling, again unrelated to MoE cache serving (Chen et al., 2020). This suggests that, in current arXiv usage, PiKV refers specifically to the MoE-oriented KV cache serving system rather than to a shared family of methods.