Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 173 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 77 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

KVDirect: Distributed LLM Inference

Updated 8 November 2025
  • KVDirect is a distributed inference framework for large language models that uses a tensor-centric KV cache transfer protocol to enable scalable, disaggregated serving.
  • It introduces pull-based cache transfer and dynamic GPU scheduling to reduce latency by up to 55% and maximize hardware utilization.
  • KVDirect transforms LLM inference by decoupling prefill and decode stages, providing elastic resource allocation and improved performance in multi-node GPU clusters.

KVDirect is a distributed inference framework for LLMs that enables scalable, disaggregated serving by optimizing key-value (KV) cache transfer between prefill and decode workers. By introducing a tensor-centric communication protocol, a custom GPU resource scheduling library, and pull-based KV cache transfer, KVDirect allows the prefill and decode stages of LLM inference to be elastically scheduled across multiple nodes, reducing resource contention, maximizing hardware utilization, and minimizing end-to-end latency. This design stands in contrast to previous systems that were restricted to single-node deployments or suffered from inefficient cache transfer and scheduling, and establishes a new baseline for scalable, high-performance LLM serving in distributed GPU clusters (Chen et al., 13 Dec 2024).

1. Disaggregated LLM Inference and System Motivation

Standard LLM inference pipelines consist of two phases: prefill—processing the prompt to initialize the KV cache and produce the first token, and decode—autoregressively generating subsequent tokens using the accumulated KV cache. Traditionally, both phases are colocated on a single worker, leading to static resource partitioning, poor elasticity, and suboptimal response times especially under mixed or bursty workloads.

Disaggregated inference separates the execution of prefill and decode. Prefill and decode workers can be dynamically provisioned and independently scheduled, supporting pipeline parallelism and resource heterogeneity. However, previous disaggregated frameworks have been confined to single-node deployments due to the communication overhead of transferring KV caches. Inefficiencies in existing message-passing systems (NCCL, UCX, MPI) and the lack of fast, direct GPU-to-GPU data movement have limited scalability and led to idle compute resources during long memory-bound operations. KVDirect directly addresses these limitations by rearchitecting the computation–communication boundary and focusing on efficient distributed KV cache transfer.

2. Tensor-Centric KV Cache Transfer Protocol

At the core of KVDirect is a tensor-centric strategy for communicating KV caches between prefill and decode workers. Instead of treating the KV cache as a monolithic or serialized blob, KVDirect exchanges structured metadata, enabling direct transfer of (possibly non-contiguous) tensor blocks.

Key design features:

  • On connection (Connect()), prefill workers send the decode worker a metadata descriptor: base memory address, multi-dimensional shape, and stride information (e.g., for batch, layer, head, and token axes).
  • The decode worker computes offsets and sizes for each requested block using the formula

Offset=iindexi×stridei\text{Offset} = \sum_i \text{index}_i \times \text{stride}_i

with per-block sizes computed as

BlockSize=L×H×D×datatype size\text{BlockSize} = L \times H \times D \times \text{datatype size}

  • Multiple adjacent blocks are merged into coalesced Remote Direct Memory Access (RDMA) transactions when contiguous in physical memory, minimizing transaction counts and maximizing PCIe/network bandwidth.
  • The protocol transmits all layout metadata in a single upfront message, avoiding repeated synchronizations for each block or layer.
  • Unlike standard message-passing, where only \sim13% of time is spent transferring data (with much wasted on synchronization and memory operations), KVDirect maximizes actual bandwidth use, achieving up to 22 GB/s with a 400 Gbps NIC.

This tensor-centric logic is essential for high-throughput serving, as the KV cache is typically fragmented into hundreds of tensor blocks, and operations must not block on each block's transfer or launch spurious GPU kernels.

3. Custom Distributed Communication Library and Dynamic GPU Scheduling

KVDirect is built on a specialized communication library implementing direct GPU-to-GPU RDMA and supporting dynamic resource allocation, in contrast to static, graph-based approaches in libraries such as NCCL or UCX:

  • Connection management: Decoders maintain persistent connections to all active prefill workers, coordinated via an external scheduler. GPU-NIC affinity is enforced to maximize link utilization.
  • Elastic resource pool: Workers can be added or removed at runtime without reinitialization or communication graph rebuild. This dynamic elasticity enables fine-grained adaptation to workload changes and failure scenarios.
  • Transfer API: High-level routines expose Connect(), Transfer(), and Complete() operations for orchestrating cache handoff. Read transactions are asynchronous and pipelined; completions are ACKed for resource safety.
  • Multi-rail networking: Supports aggregating bandwidth across multiple network interfaces per node.

KVDirect's infrastructure integrates with commonly used LLM schedulers, such as vLLM, exposing tensor-centric APIs for pluggable communication in inference pipelines.

4. Pull-Based KV Cache Transfer and Lifetime Optimization

A key architectural difference in KVDirect is its adoption of a pull-mode KV cache transfer strategy, in contrast to conventional push-based communication:

  • In push mode, the prefill worker computes and sends each tensor block to the decode worker as soon as it is available. Both workers must reserve memory for the entire cache until all transfers complete, increasing latency and peak memory requirements.
  • In pull mode (KVDirect default), the prefill worker signals completion via block IDs, and the decode worker initiates a burst transfer for all needed blocks, allocating memory only when required. Memory is released promptly, reducing the time-critical resources are held in-use.
  • Empirically, pull mode reduces the average cache lifetime and GPU idling, enabling higher concurrency and alleviating queuing bottlenecks, especially for long prompts or high query-per-second (QPS) scenarios.
  • Pipelined communication offers minimal gains on top of the pull-based schedule given that transmission time is already negligible in the context of the optimized protocol.

5. Quantitative Performance Analysis

KVDirect demonstrates substantial improvements relative to prior distributed inference systems:

  • Latency: Up to 55% reduction in per-request latency compared to the vLLM baseline under matched resource conditions. For extended prompts or high concurrency, vLLM latency increases by 12×\times, while KVDirect remains invariant.
  • Cache transfer time: Reduced to as little as 1% of the total request latency, essentially eliminating cache transfer as a bottleneck for distributed scheduling.
  • Bandwidth: Achieves up to 22 GB/s on 400 Gbps hardware, near peak utilization, while UCX is capped at \sim4 GB/s under similar conditions.
  • Coalescing optimization: Merges adjacent transfers, yielding up to 1.32×\times further speedup.
  • Scaling behavior: Increasing decode workers reduces decode tokens-per-second contention by up to 58%; more prefill workers decrease prefill delays by up to 4×\times, supporting fine-tuned worker ratios for varied workload mixes.
  • Pull vs. push: Pull mode offers a 25.5% average speedup over push on synthetic and real workloads.

The following table summarizes the landscape of KV cache transfer frameworks:

System Distributed? GPU Scheduling Comm Bandwidth Utilization Push/Pull KV Cache Transfer Optimized? Latency Reduction
DistServe No (Single) No Moderate Push No Low
Splitwise No (Single) No Low Push No Moderate
vLLM No Partially N/A N/A N/A Baseline
Mooncake Partial No Low (CPU bottleneck) N/A No N/A
KVDirect Yes Yes High (~22 GB/s) Pull Yes (tensor-centric) Up to 55%

Prior systems such as DistServe and Splitwise are confined to single-node instances and lack transfer optimization. Mooncake, Memserve, and SGLang prioritize CPU-centric KV handling and cannot leverage full GPU interconnect. KVDirect uniquely enables distributed elasticity, direct GPU scheduling, and high-efficiency transfer, representing the first practical solution for cross-node disaggregated LLM inference with these properties.

7. Implementation Considerations and Integration

KVDirect is open-source and adopts a modular design for integration with established inference schedulers and hardware. Critical engineering aspects include:

  • Support for multi-node and multi-GPU deployments with low-level memory registration and RDMA.
  • Robustness to heterogeneous cluster topologies, GPU–NIC affinity, and node-level elasticity.
  • Compatibility with dynamic workloads, providing stable time-to-first-token (TTFT) and time-between-tokens (TBT) under high concurrency and varying prompt lengths.
  • Transparent to LLM model forwarding; application code need only interact at the cache transfer and scheduling layers.

8. Implications, Limitations, and Future Directions

By resolving the KV cache transfer bottleneck, KVDirect enables new paradigms for distributed, disaggregated LLM inference. Immediate consequences include:

  • Practical scaling to large clusters, supporting higher QPS, longer contexts, and larger numbers of active prompts without memory deadlocks or throughput collapse.
  • Lower-latency online inference for conversational, code-completion, or retrieval-augmented LLM workloads.
  • Improved GPU utilization, reduced stranding of compute and memory resources, and elastic adaptation to workload bursts. A plausible implication is that as attention-based architectures continue to drive the need for large, persistent caches, systems-level innovations akin to KVDirect will be increasingly critical to achieve operational efficiency in cloud-scale and multi-tenant serving environments. Further research may explore integration with advanced KV compression (quantization, PCA-based transform coding), automated scheduling of compute/communication, and expansions to multi-modal and visual transformer pipelines.

KVDirect thus defines a new class of distributed, tensor-centric inference systems, setting a reference standard for scalable, flexible, and efficient LLM serving in contemporary AI infrastructure.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to KVDirect.