Papers
Topics
Authors
Recent
Search
2000 character limit reached

PiKV: Efficient KV Cache for MoE

Updated 4 July 2026
  • PiKV is a key-value cache management system for mixture-of-experts architectures that reduces memory and communication bottlenecks during long-context inference.
  • It integrates expert-sharded storage, cache-aware routing, adaptive scheduling, and compression to efficiently serve sparse MoE models in distributed multi-GPU setups.
  • Empirical results show up to 3.9× memory reduction and 1.7× latency improvement, demonstrating its practical benefits in large-scale MoE deployments.

Searching arXiv for PiKV and related entries to ground the article in current papers. Searching arXiv for "PiKV KV Cache Management System for Mixture of Experts". PiKV is a KV-cache management system for sparse Mixture-of-Experts (MoE) LLMs that treats the key-value cache as a distributed, query-driven, optimized data store rather than as a passive attention buffer. It is presented as a parallel and distributed KV cache serving framework tailored to MoE architecture, with four core elements: expert-sharded KV storage, PiKV routing, PiKV Scheduling, and PiKV Compression (Liu et al., 2 Aug 2025). Its stated objective is to reduce the memory and communication bottlenecks created by long-context autoregressive inference, especially in multi-GPU and multi-node settings where MoE computation is sparse but KV storage typically remains dense and globally synchronized.

1. Concept and motivation

PiKV is designed for the setting in which LLMs scale in both parameter count and context length, causing KV cache storage to become a dominant systems bottleneck. In autoregressive inference, each token attends to all previous tokens through cached keys and values,

Attn(xt)=τ<tατVτ,(Kτ,Vτ)KVCache.\text{Attn}(x_t) = \sum_{\tau < t} \alpha_\tau \cdot V_\tau, \quad (K_\tau, V_\tau) \in \text{KVCache}.

With sequence length LL, hidden dimension dd, and multi-head attention, the in-memory KV cache grows as O(Ld)O(Ld) per layer. The paper gives a concrete example of a “7B-scale MoE model with 128K context and 16 experts” whose full KV cache uses more than 24GB (Liu et al., 2 Aug 2025).

The system is motivated by a structural mismatch in MoE inference. Sparse MoE layers activate only a small subset of experts for each token, but conventional KV caching remains dense. For a token sequence {xt}t=1L\{x_t\}_{t=1}^{L}, an MoE router selects top-kk experts R(xt)\mathcal{R}(x_t), and attention is written as

Attn(xt)=eR(xt)τ<tατ(e)valueτ(e).\text{Attn}(x_t) = \sum_{e \in \mathcal{R}(x_t)} \sum_{\tau < t} \alpha_{\tau}^{(e)} \cdot \text{value}_\tau^{(e)}.

This yields sparse computation, but the KV caches are typically still stored as full, unpartitioned histories and are often replicated or synchronized across devices. PiKV is defined against this paradox: computation is sparse, but KV storage and access patterns are dense and global (Liu et al., 2 Aug 2025).

The framework therefore co-designs three decisions that are often treated independently: how tokens are routed to experts and KV shards, how KV entries are compressed, and how KV entries are retained or evicted under memory pressure. The paper formulates this as a joint optimization over routing R\mathcal{R}, compression C\mathcal{C}, and scheduling LL0,

LL1

A plausible implication is that PiKV should be understood less as a single compression algorithm than as a serving substrate for MoE inference (Liu et al., 2 Aug 2025).

2. System architecture

PiKV is organized around four modules that jointly mediate all KV reads and writes during decoding.

Module Function Main objective
Expert-sharded KV storage Partition KV by token index and expert ID across devices Lower per-GPU memory and improve locality
PiKV Routing Choose experts in a cache-aware manner Reduce token-to-KV access cost
PiKV Scheduling Retain or evict KV pages adaptively Stay within memory budget with high hit-rate
PiKV Compression Compress KV representations on insertion Reduce storage and read bandwidth

The execution framework is described as an asynchronous decoding loop. Given query stream LL2, expert set LL3, and shard size LL4, PiKV initializes a distributed cache, routing policy, scheduler, and compressor, then iterates over decoding steps: kk8 In this arrangement, PiKV sits between the attention layers and device memory: Shard(t, e) maps token-expert pairs to shards, C[e] [s] denotes per-expert per-shard KV buffers, the compressor acts on insertion, the scheduler manages cache lifetime, and attention consumes only the relevant sharded cache (Liu et al., 2 Aug 2025).

This architecture is explicitly targeted at distributed MoE deployments. The paper states that PiKV is an open-source software library and also reports an integration with Nvidia kvpress through PiKVpress (Liu et al., 2 Aug 2025).

3. Sharding and routing mechanisms

The storage layer replaces dense replicated KV with expert-sharded KV storage. Instead of storing the full sequence history on every device, PiKV partitions KV entries by both token index and expert ID, assigning each shard to a specific GPU. Given a KV pair LL5 at time LL6 for expert LL7, the shard index is defined as

LL8

where LL9 is the number of token shards and dd0 is the number of expert shards. Each shard is implemented as a circular buffer with capacity dd1, allowing dd2 insertion and overwrite of the oldest entries when full (Liu et al., 2 Aug 2025).

If compression reduces the KV width from dd3 to dd4, the memory of a shard is

dd5

This localizes KV ownership to particular devices, but it also introduces the possibility of remote fetches when a routed expert’s shard resides elsewhere. PiKV addresses that systems problem through cache-aware routing and scheduling rather than through storage alone (Liu et al., 2 Aug 2025).

PiKV Routing extends ordinary MoE gating into a cache-aware routing policy. The routing function is written as dd6 with dd7 and dd8. The paper enumerates several routing mechanisms, including base hash or round-robin, TopK softmax, TopK with load balance, cache-aware PiKVRouter, entropy-penalized load balancing, RL-adaptive gating, and hierarchical coarse-to-fine routing (Liu et al., 2 Aug 2025).

In PiKVRouter, expert scoring is modified by a cache-miss penalty:

dd9

This expresses the central design choice of PiKV routing: prefer experts whose caches are warm, while still respecting MoE gating. The paper also presents load-balancing penalties and entropy penalties, suggesting that routing in PiKV is a systems-level policy rather than a purely modeling-level decision (Liu et al., 2 Aug 2025).

For memory traffic, the paper contrasts dense and sparse MoE attention. With dense access,

O(Ld)O(Ld)0

whereas with sparse PiKV routing,

O(Ld)O(Ld)1

The reduction factor is therefore approximately O(Ld)O(Ld)2. The paper further states a heuristic cache hit-rate approximation of O(Ld)O(Ld)3, and derives a throughput scaling factor

O(Ld)O(Ld)4

These expressions formalize PiKV’s claim that sparse expert activation should induce sparse KV access as well (Liu et al., 2 Aug 2025).

4. Scheduling and compression

PiKV Scheduling governs which KV pages remain in GPU memory under a fixed budget. The paper describes page-based scheduling rather than token-only policies, aligning the eviction unit with the sharded storage layout. Each page or token group is assigned a utility O(Ld)O(Ld)5, and low-utility pages are evicted. Enumerated policies include attention-based scoring (O(Ld)O(Ld)6), age-based scheduling, MLP-based scoring (O(Ld)O(Ld)7), planning-based scheduling, LRU, LRU+, AdaKV, and Duo (Liu et al., 2 Aug 2025).

The adaptive case is summarized as

O(Ld)O(Ld)8

where O(Ld)O(Ld)9 are features such as recency, frequency, and attention, {xt}t=1L\{x_t\}_{t=1}^{L}0 is the measured cache hit-rate, and {xt}t=1L\{x_t\}_{t=1}^{L}1 is the target hit-rate. The stated purpose is to maintain a high hit-rate while respecting a memory ceiling (Liu et al., 2 Aug 2025).

The memory model decomposes total per-GPU KV memory into token-buffer and page-buffer terms:

{xt}t=1L\{x_t\}_{t=1}^{L}2

{xt}t=1L\{x_t\}_{t=1}^{L}3

so that

{xt}t=1L\{x_t\}_{t=1}^{L}4

Minimizing with respect to shard capacity {xt}t=1L\{x_t\}_{t=1}^{L}5 yields

{xt}t=1L\{x_t\}_{t=1}^{L}6

and the corresponding minimum

{xt}t=1L\{x_t\}_{t=1}^{L}7

This is one of the key analytical design rules in the PiKV system (Liu et al., 2 Aug 2025).

PiKV Compression reduces each KV vector before storage:

{xt}t=1L\{x_t\}_{t=1}^{L}8

The reconstruction error is defined as

{xt}t=1L\{x_t\}_{t=1}^{L}9

with decoder kk0. The paper lists LoRA, LoRA++, PyramidKV, ChunkKV, truncated SVD, FastV, distillation, and structured pruning as supported compression families (Liu et al., 2 Aug 2025).

The latency model separates read and decode costs:

kk1

kk2

and thus

kk3

For compression ratios kk4, the idealized speedup is

kk5

The paper therefore treats compression as a first-class systems control knob rather than as a secondary optimization (Liu et al., 2 Aug 2025).

5. Analytical behavior and empirical results

The analytical framing of PiKV centers on memory, communication, and throughput. Relative to dense replicated KV, the paper states that PiKV’s sharded and compressed memory grows according to

kk6

with sublinear square-root behavior at the optimal shard size. The communication analysis is not summarized by a single closed-form, but the system analysis reports that PiKV communication patterns are 58–62% local, 28–32% remote, and 8–12% broadcast/sync, with 45–52% communication efficiency improvement compared to naive distributed KV (Liu et al., 2 Aug 2025).

The experimental evaluation covers Switch-Transformer-1.6T, GLaM-1.2T, PaLM-540B in expert-parallel form, and Mixtral-8x7B on multi-GPU clusters with NVLink and InfiniBand HDR200. Across these models, the paper reports up to 3.9× memory reduction, up to 1.7× latency improvement, and competitive accuracy (Liu et al., 2 Aug 2025).

A model-level summary given in the paper is as follows.

Model Throughput Memory Accuracy drop Latency
Switch-1.6T 2.8× 3.2× 1.2% 2.1×
GLaM-1.2T 2.3× 2.9× 0.8% 1.9×
PaLM-540B 3.1× 3.5× 1.5% 2.4×
Mixtral-8x7B 2.5× 2.8× 1.1% 2.0×

On LongBench, the paper reports accuracy within about 1.2% of full KV on all datasets, memory reduction around 2.9×, and throughput improvement around 2.4×. For NarrativeQA and HotpotQA, it reports 77.2% and 73.3% for PiKV, compared with 66.3% and 60.0% for H2O (Liu et al., 2 Aug 2025).

The paper also emphasizes scaling with context length. For Switch-1.6T, throughput improvement grows from about 1.4× at 4K context to about 2.7× at 64K, and latency is described as sublinear in sequence growth, whereas competitors show near-quadratic scaling (Liu et al., 2 Aug 2025).

Ablation studies attribute different benefits to the individual modules. Adaptive routing reduces accuracy degradation from about 1.3% to about 0.6% versus baseline gating. Compression provides the largest memory gains, up to about 2.8× memory reduction, with about 1.9% accuracy drop at aggressive settings. Scheduling improves KV cache hit-rate from about 78% to about 94%. The combined system yields about 1.9× memory reduction, about 20% latency improvement, about 89% hit-rate, and about 1.0–1.2% accuracy drop from full KV (Liu et al., 2 Aug 2025).

6. Deployment, limitations, and nomenclature

PiKV is explicitly positioned as an inference-time framework for MoE architectures rather than as a training-time method. The paper states that it is most beneficial when context lengths are large, especially 32K–128K, and when multi-GPU or multi-node deployments are dominated by KV-memory footprint and KV-related communication (Liu et al., 2 Aug 2025). The practical guidance recommends using expert-sharded KV with the provided shard mapping, setting shard capacity close to

kk7

starting from TopK softmax or load-balanced TopK routing, enabling cache-aware PiKVRouter when cache misses are high, and choosing scheduling and compression policies according to the desired trade-off between accuracy and memory savings (Liu et al., 2 Aug 2025).

The limitations identified in the paper are primarily systems-level. For small workloads or low concurrency, the overhead of routing, complex scheduling, and compression may outweigh the benefit. Very aggressive compression can cause noticeable accuracy degradation, reaching 10–13% in extreme experiments. Complex routing strategies such as RL-based or hierarchical methods may introduce extra overhead. The paper also notes edge cases in which irregular routing or skewed expert use can create hot-spot shards or communication spikes if routing is not well tuned (Liu et al., 2 Aug 2025).

Future work is framed around online adaptation, hierarchical memory tiers such as GPU+CPU+NVMe, and tighter integration with training-time sparsity. The project is described as “a living project” aiming to become “a comprehesive KV Cache management system for MoE Architectures” (Liu et al., 2 Aug 2025).

Within the broader “PiK*” naming landscape, PiKV is distinct from several unrelated methods whose names can appear superficially similar. “Parameter-Inverted Image Pyramid Networks” defines PIIP for multi-scale vision backbones rather than KV caching (Zhu et al., 2024). “PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch” introduces PiKa alignment datasets and explicitly notes that there is no component in that paper called “PiKV” (Yin et al., 8 Oct 2025). “APIK: Active Physics-Informed Kriging Model with Partial Differential Equations” uses PIK and APIK for PDE-informed Gaussian-process modeling, again unrelated to MoE cache serving (Chen et al., 2020). This suggests that, in current arXiv usage, PiKV refers specifically to the MoE-oriented KV cache serving system rather than to a shared family of methods.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PiKV.