ShadowServe: LLM Caching System
- ShadowServe is a prefix-caching system that offloads KV cache processing to SmartNICs, reducing GPU interference and enabling high-throughput LLM serving.
- Its architecture separates functions between a host-based control plane and a SmartNIC-based data plane, employing a chunked processing pipeline for parallel operations.
- Empirical analysis shows that ShadowServe significantly lowers time-per-output-token and time-to-first-token while improving overall throughput compared to GPU-based approaches.
ShadowServe is a prefix-caching system tailored for LLM serving that eliminates the traditional bottlenecks of bandwidth limitations and compute interference on the GPU. It introduces an architecture based on functional disaggregation—distinct host-based control and SmartNIC-based data planes—together with a chunked processing pipeline and minimal-copy memory management framework. These innovations collectively enable interference-free, high-throughput key-value (KV) cache fetching for distributed prefix caching scenarios, a crucial feature for pipelined and long-context LLM serving.
1. Architectural Overview
At the core of ShadowServe is the strict separation between a host-resident control plane and a SmartNIC-resident data plane. The control plane, operating on the host CPU, orchestrates request scheduling and tracks KV cache availability. Upon detection that a prefix cache for a given request exists in remote storage, the control plane intercepts the request, removes it from the GPU’s immediate processing path, and forwards it asynchronously to the SmartNIC. This reduces the effective I/O latency imposed on subsequent computation.
The data plane is fully offloaded to the SmartNIC, where it executes all data-path operations: fetching compressed KV caches from network storage, decompressing the cached tensors, dequantizing data, and using direct memory access (DMA) to transfer tensors into GPU memory space. Host–SmartNIC coordination employs lightweight metadata and completion notifications, implemented, for example, via PCIe channels with DOCA Comch. The split ensures that all performance-critical and data-heavy operations are completely isolated from the resources (CPU, GPU) responsible for model execution.
2. SmartNIC Offload and Pipeline Design
ShadowServe specifically targets SmartNICs such as the NVIDIA BlueField‑3 Data Processing Unit (DPU), which integrate multiple Arm cores, hardware decompression accelerators, and a DMA engine. The offloading of decompression—traditionally performed on the GPU—to the SmartNIC’s dedicated hardware removes the interference between GPU model computation and decompression-related kernel launches.
The architecture employs a chunked pipelining strategy. Each KV cache is partitioned into fixed-size chunks (notably, 256 tokens per chunk). Each chunk advances consecutively through four pipeline stages executed entirely on the SmartNIC:
- Network Fetch: The SmartNIC receives compressed KV cache chunks via its internal TCP/IP stack.
- Lossless Decompression: Hardware accelerators decompress the received data.
- Dequantization: SmartNIC Arm cores execute vectorized instructions to restore cache chunks to their original tensor format.
- DMA Transfer: Data is transferred from the SmartNIC’s local memory to the target GPU memory through peer-to-peer DMA.
This can be formalized as: The pipeline’s parallelization ensures that different chunks can simultaneously reside in distinct stages, allowing for maximal utilization of SmartNIC compute and minimizing stage-level idle times.
3. Technical Innovations
ShadowServe’s major technical contributions are the chunked pipeline and minimal-copy memory management scheme:
- Chunked Pipeline: Unlike sequential KV cache prefetching, this methodology ensures that network fetch, decompression, dequantization, and DMA transfer execute in parallel across segmented chunks. Formally, for each chunk of size
buffer partitioning is designed such that each operation accesses its own designated memory regions, avoiding intrastage data copying. The dequantization buffer, whose occupancy is half the full-size buffer (accounting for quantization/reduction), enhances processing efficiency.
- Minimal-Copy Memory Management: To bypass expensive memory allocation, registration, and copying overheads endemic to platforms like BlueField‑3, the system pre-partitions and pins memory buffers for each pipeline stage both on the SmartNIC and the attached GPU. Data proceeds directly from one pipeline stage to the next without redundant copies. This approach minimizes memory fragmentation and overheads, allowing for sustained high throughput under memory-constrained SmartNIC scenarios. As depicted in Figure~6 (used here as an example reference), decompression, dequantization, and DMA buffers are organized to support seamless, direct inter-stage data progression.
4. Performance Analysis
Empirical evaluation demonstrates that ShadowServe achieves substantial gains in LLM inference performance over prior art (e.g., CacheGen-Async) that uses GPU for data-plane decompression:
- Loaded Time-Per-Output-Token (TPOT): Under heavy-load, ShadowServe achieves up to 2.2× lower TPOT than GPU-decompression approaches. Because the GPU is not interrupted by decompression kernels, the latency between consecutive output token generations is minimized.
- Time-To-First-Token (TTFT): In bandwidth-constrained conditions (≤20 Gbps), ShadowServe reduces TTFT by up to 1.38×, significantly decreasing the end-user’s perceived latency from request submission to the availability of the first output token.
- Throughput: These latency savings yield an overall throughput improvement of up to 1.35×, as idle time in model execution is curtailed and decompression no longer competes for GPU resources.
Graphical evaluations (e.g., latency curves, TPOT/TTFT heatmaps across bandwidths) in the published analysis corroborate these improvements.
5. Comparison to Prior Approaches
ShadowServe is benchmarked primarily against advanced GPU-based KV cache fetching systems such as CacheGen-Async. In such legacy designs, decompression kernels interfere with model computation, resulting in increased TPOT and TTFT, particularly under constrained bandwidth or extended prompts.
ShadowServe's exclusive SmartNIC offload—employing fully parallelized chunked pipelines and bypassing host and GPU involvement until decompressed data is ready for decoding—consistently yields lower observed latencies and higher throughput across tested workloads. The systematic isolation of decompression/dequantization from GPU-backed computation introduces a practical, scalable design that is beneficial for multi-user LLM deployments, especially as context length and concurrency demands increase.
6. Significance and Implications
The engineering design of ShadowServe reflects key trends in distributed deep-learning infrastructure: functional disaggregation, accelerator-aware data path design, and memory-efficient pipelining. By shifting non-model-related workloads—KV cache decompression, dequantization, and movement—off critical GPU resources, distributed prefix caching becomes not only more performant but also more predictable, with well-defined separation of concerns.
This suggests a broader architectural paradigm for scalable LLM serving: offload as much “data-plane” logic as possible to SmartNICs equipped with domain-specific acceleration, while confining GPUs strictly to the decoding and model execution critical path. A plausible implication is that as SmartNIC/DPU capabilities grow, further workload splitting may become tractable, altering both the scaling properties and throughput ceilings of future LLM serving platforms.