Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 65 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 80 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 453 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

ShadowServe: LLM Caching System

Updated 28 September 2025
  • ShadowServe is a prefix-caching system that offloads KV cache processing to SmartNICs, reducing GPU interference and enabling high-throughput LLM serving.
  • Its architecture separates functions between a host-based control plane and a SmartNIC-based data plane, employing a chunked processing pipeline for parallel operations.
  • Empirical analysis shows that ShadowServe significantly lowers time-per-output-token and time-to-first-token while improving overall throughput compared to GPU-based approaches.

ShadowServe is a prefix-caching system tailored for LLM serving that eliminates the traditional bottlenecks of bandwidth limitations and compute interference on the GPU. It introduces an architecture based on functional disaggregation—distinct host-based control and SmartNIC-based data planes—together with a chunked processing pipeline and minimal-copy memory management framework. These innovations collectively enable interference-free, high-throughput key-value (KV) cache fetching for distributed prefix caching scenarios, a crucial feature for pipelined and long-context LLM serving.

1. Architectural Overview

At the core of ShadowServe is the strict separation between a host-resident control plane and a SmartNIC-resident data plane. The control plane, operating on the host CPU, orchestrates request scheduling and tracks KV cache availability. Upon detection that a prefix cache for a given request exists in remote storage, the control plane intercepts the request, removes it from the GPU’s immediate processing path, and forwards it asynchronously to the SmartNIC. This reduces the effective I/O latency imposed on subsequent computation.

The data plane is fully offloaded to the SmartNIC, where it executes all data-path operations: fetching compressed KV caches from network storage, decompressing the cached tensors, dequantizing data, and using direct memory access (DMA) to transfer tensors into GPU memory space. Host–SmartNIC coordination employs lightweight metadata and completion notifications, implemented, for example, via PCIe channels with DOCA Comch. The split ensures that all performance-critical and data-heavy operations are completely isolated from the resources (CPU, GPU) responsible for model execution.

2. SmartNIC Offload and Pipeline Design

ShadowServe specifically targets SmartNICs such as the NVIDIA BlueField‑3 Data Processing Unit (DPU), which integrate multiple Arm cores, hardware decompression accelerators, and a DMA engine. The offloading of decompression—traditionally performed on the GPU—to the SmartNIC’s dedicated hardware removes the interference between GPU model computation and decompression-related kernel launches.

The architecture employs a chunked pipelining strategy. Each KV cache is partitioned into fixed-size chunks (notably, 256 tokens per chunk). Each chunk advances consecutively through four pipeline stages executed entirely on the SmartNIC:

  1. Network Fetch: The SmartNIC receives compressed KV cache chunks via its internal TCP/IP stack.
  2. Lossless Decompression: Hardware accelerators decompress the received data.
  3. Dequantization: SmartNIC Arm cores execute vectorized instructions to restore cache chunks to their original tensor format.
  4. DMA Transfer: Data is transferred from the SmartNIC’s local memory to the target GPU memory through peer-to-peer DMA.

This can be formalized as: KV CacheFetchDecomp.Dequant.DMA TransferGPU Memory\text{KV Cache} \rightarrow \text{Fetch} \rightarrow \text{Decomp.} \rightarrow \text{Dequant.} \rightarrow \text{DMA Transfer} \rightarrow \text{GPU Memory} The pipeline’s parallelization ensures that different chunks can simultaneously reside in distinct stages, allowing for maximal utilization of SmartNIC compute and minimizing stage-level idle times.

3. Technical Innovations

ShadowServe’s major technical contributions are the chunked pipeline and minimal-copy memory management scheme:

  • Chunked Pipeline: Unlike sequential KV cache prefetching, this methodology ensures that network fetch, decompression, dequantization, and DMA transfer execute in parallel across segmented chunks. Formally, for each chunk of size

SizeDMA=Ntokens×Token Dimensionality×Bytes per Element,\text{Size}_{\text{DMA}} = N_{\text{tokens}} \times \text{Token Dimensionality} \times \text{Bytes per Element},

buffer partitioning is designed such that each operation accesses its own designated memory regions, avoiding intrastage data copying. The dequantization buffer, whose occupancy is half the full-size buffer (accounting for quantization/reduction), enhances processing efficiency.

  • Minimal-Copy Memory Management: To bypass expensive memory allocation, registration, and copying overheads endemic to platforms like BlueField‑3, the system pre-partitions and pins memory buffers for each pipeline stage both on the SmartNIC and the attached GPU. Data proceeds directly from one pipeline stage to the next without redundant copies. This approach minimizes memory fragmentation and overheads, allowing for sustained high throughput under memory-constrained SmartNIC scenarios. As depicted in Figure~6 (used here as an example reference), decompression, dequantization, and DMA buffers are organized to support seamless, direct inter-stage data progression.

4. Performance Analysis

Empirical evaluation demonstrates that ShadowServe achieves substantial gains in LLM inference performance over prior art (e.g., CacheGen-Async) that uses GPU for data-plane decompression:

  • Loaded Time-Per-Output-Token (TPOT): Under heavy-load, ShadowServe achieves up to 2.2× lower TPOT than GPU-decompression approaches. Because the GPU is not interrupted by decompression kernels, the latency between consecutive output token generations is minimized.
  • Time-To-First-Token (TTFT): In bandwidth-constrained conditions (≤20 Gbps), ShadowServe reduces TTFT by up to 1.38×, significantly decreasing the end-user’s perceived latency from request submission to the availability of the first output token.
  • Throughput: These latency savings yield an overall throughput improvement of up to 1.35×, as idle time in model execution is curtailed and decompression no longer competes for GPU resources.

Graphical evaluations (e.g., latency curves, TPOT/TTFT heatmaps across bandwidths) in the published analysis corroborate these improvements.

5. Comparison to Prior Approaches

ShadowServe is benchmarked primarily against advanced GPU-based KV cache fetching systems such as CacheGen-Async. In such legacy designs, decompression kernels interfere with model computation, resulting in increased TPOT and TTFT, particularly under constrained bandwidth or extended prompts.

ShadowServe's exclusive SmartNIC offload—employing fully parallelized chunked pipelines and bypassing host and GPU involvement until decompressed data is ready for decoding—consistently yields lower observed latencies and higher throughput across tested workloads. The systematic isolation of decompression/dequantization from GPU-backed computation introduces a practical, scalable design that is beneficial for multi-user LLM deployments, especially as context length and concurrency demands increase.

6. Significance and Implications

The engineering design of ShadowServe reflects key trends in distributed deep-learning infrastructure: functional disaggregation, accelerator-aware data path design, and memory-efficient pipelining. By shifting non-model-related workloads—KV cache decompression, dequantization, and movement—off critical GPU resources, distributed prefix caching becomes not only more performant but also more predictable, with well-defined separation of concerns.

This suggests a broader architectural paradigm for scalable LLM serving: offload as much “data-plane” logic as possible to SmartNICs equipped with domain-specific acceleration, while confining GPUs strictly to the decoding and model execution critical path. A plausible implication is that as SmartNIC/DPU capabilities grow, further workload splitting may become tractable, altering both the scaling properties and throughput ceilings of future LLM serving platforms.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ShadowServe.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube