Papers
Topics
Authors
Recent
Search
2000 character limit reached

DistServe: Optimized Disaggregated LLM Serving

Updated 19 January 2026
  • DistServe is an architectural framework that disaggregates LLM serving by decoupling prompt encoding (prefill) and autoregressive generation (decoding) across dedicated GPUs.
  • It utilizes simulation-based parallelism and tailored GPU resource allocation to maximize sustainable request rates under strict TTFT and TPOT latency objectives.
  • Evaluations show DistServe achieves up to 4.48× higher throughput and 10.2× tighter SLO compliance compared to conventional colocation-based systems.

DistServe is an architectural framework for LLM serving that explicitly disaggregates the prefill (prompt encoding) and decoding (autoregressive generation) stages across distinct GPU instances, enabling tightly optimized resource allocation, parallelism configuration, and latency SLO (service level objective) compliance. DistServe departs from conventional LLM serving systems that colocate prefill and decoding on the same resources and batch both phases together, addressing major interference and coupling limitations inherent in monolithic serving architectures (Zhong et al., 2024).

1. Architectural Rationale: Disaggregation of Prefill and Decoding

Modern auto-regressive LLM inference consists of a compute-heavy prefill phase, which encodes the user's entire prompt in one Transformer pass to emit the initial output token, followed by an iterative decoding phase that generates each subsequent token autoregressively with an evolving KV-cache. Conventional architectures batch both phases on shared GPUs, aiming for maximum aggregate throughput; however, this policy introduces two fundamental drawbacks:

  • Prefill–decoding interference: Large prompt batches (prefill) delay decoding requests, inflating time-per-output-token (TPOT), while queued decoding slows first-token emission (TTFT).
  • Resource coupling: Shared allocation binds parallelism and GPU usage, though prefill and decoding exhibit disparate compute, memory, and SLO requirements.

DistServe segregates prefill and decoding onto dedicated GPU instances, eliminating phase interference. This permits phase-tailored selection of parallelism degrees (tensor/pipeline splits), and dynamic per-phase allocation to maximize per-GPU “goodput”—the sustainable request rate meeting both TTFT and TPOT SLOs (Zhong et al., 2024).

2. Latency Objectives and Goodput Formulation

LLM application responsiveness is measured by two phase-specific latency metrics:

  • Time to First Token (TTFT): Wall-clock interval from request arrival to first output token emission.
  • Time Per Output Token (TPOT): Mean wall-clock latency per subsequent token.

Given target SLOs τ1\tau_1 (TTFT), τ2\tau_2 (TPOT), and an attainment goal α\alpha (fraction of requests compliant), DistServe maximizes sustainable request rate RreqR_{\text{req}} per GPU such that P(TTFTτ1TPOTτ2)α\mathbb{P}(\text{TTFT} \leq \tau_1 \land \text{TPOT} \leq \tau_2) \geq \alpha, formulated as:

G(Kp,cp,Kd,cd)=min{gp(Kp,cp),gd(Kd,cd)}Kp+KdG(K_p, c_p, K_d, c_d) = \frac{\min\{g_p(K_p, c_p), g_d(K_d, c_d)\}}{K_p + K_d}

Subject to:

  • Kp+KdNmaxK_p+K_d \leq N_{\text{max}} (GPU budget)
  • Memory constraints per configuration
  • gpg_p and gdg_d determined via phase-specific discrete event simulation under actual workload and SLO conditions

This optimization is solved by enumerating feasible parallelism, simulating phase throughputs, and selecting the allocation that maximizes goodput (Zhong et al., 2024).

3. Placement, Parallelism Strategy, and Bandwidth Considerations

DistServe employs a two-phase placement and parallelism search:

  • Phase 1: Enumerate all valid combinations of parallelism (tensor/pipeline degrees, GPU count) within memory limits for prefill and decoding, simulating per-batch latency using M/D/1 queue models.
  • Phase 2: For clusters with high bandwidth (e.g., Infiniband >200>200 Gbps), arbitrary placement suffices. For low-affinity clusters (e.g., intra-node NVLINK), pipeline stages of prefill/decoding are co-located on the same node to ensure KV-cache traffic uses NVLINK (600\sim600 GB/s), minimizing communication overhead.

The algorithm executes in O(NM2)O(N \cdot M^2) time, enabling rapid replanning on commodity CPUs (Zhong et al., 2024). KV-cache transfer is shown to be negligible (<0.1%<0.1\% latency, <30<30 ms for 95%95\% of requests) when using NVLINK-aware placement.

4. Performance Evaluation and Comparative Metrics

DistServe was evaluated on a 32×A100-80GB, 4×8-GPU node platform, across diverse workloads:

Workload TTFT SLO (τ1\tau_1) TPOT SLO (τ2\tau_2) SLO Attainment (%) Per-GPU Request Rate
Chatbot (13B–175B) 0.2–0.4 s 0.1–0.2 s >90 Up to 4.48× state-of-art vLLM
Code completion (66B) 0.125 s 0.2 s >90
Long-text summarization 15 s 0.15 s >90

Notable quantitative improvements over monolithic systems (vLLM):

  • Up to 4.48×4.48\times higher per-GPU sustainable request rate under identical SLO constraints
  • Up to 10.2×10.2\times tighter SLO compliance (lower τ1\tau_1, τ2\tau_2 at 90%90\% attainment)
  • KV-cache transfer latency <0.1%<0.1\% total, latency breakdown <30<30 ms for 95%95\% requests

Disaggregation in DistServe introduces KV-cache communication overhead, making sufficient intra-/inter-node bandwidth necessary. The system uses FCFS scheduling; more advanced preemptive or priority algorithms could further reduce tail latency.

Limitations:

  • Lack of built-in fault-tolerance (e.g., decoding instance failure can block multiple prefill phases)
  • Assumes stable workload; simulation fidelity reduces with highly variable input

Potential future directions include preemptive scheduling, replica-level fault recovery, adaptation to heterogeneous GPU clusters, and online auto-replanning to accommodate workload shifts (Zhong et al., 2024).

DistServe’s architectural principles influenced subsequent systems:

  • P/D-Serve (Jin et al., 2024) optimized grouping by scenario, dynamic P/D ratio adjustment, and high-throughput block-free KVCache transfer over RDMA, achieving 6.7× throughput versus aggregated serving and robust large-scale production deployment.
  • TokenScale (Lai et al., 3 Dec 2025) resolved burst backpressure by introducing proactive token-level autoscaling (Token Velocity), and convertible decoders for rapid elasticity, with SLO attainment increases and cost advantage over DistServe, BlitzScale, and AIBrix.

6. Load Balancing, Distributed Dispatching, and Theoretical Foundations

DistServe’s design reflects broader principles of distributed load balancing under partial information, as studied in the parallel server model (Goren et al., 2020). Naive schemes can degrade under multi-dispatcher settings due to herding and incast effects. The Tidal Water Filling (TWF) approach coordinates dispatch probabilities based on collective water-level minimization, yielding lower mean response time and thinner latency tails than classical JSQ(d), JIQ, and LSQ policies.

A “DistServe” load balancer, by aggregating partial queue-length information and employing stochastic water-filling, approaches centralized optimality while retaining strong stability and throughput guarantees, demonstrating that informed distributed algorithms can outperform both information-poor and naive policies in large-scale LLM serving (Goren et al., 2020).

7. Significance and Implications

DistServe establishes the viability and necessity of phase-disaggregated LLM serving to decouple latency objectives, eliminate phase interference, and rigorously optimize resource utilization. Its simulator-driven placement and explicit goodput maximization enable substantial improvements in strict SLO environments. The framework’s influence is evident in subsequent large-scale production deployments, as well as in theoretical and practical advancements in load balancing and autoscaling for disaggregated AI infrastructure.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DistServe.