Papers
Topics
Authors
Recent
2000 character limit reached

Prefill-Decode Disaggregation Architecture

Updated 29 December 2025
  • Prefill-Decode Disaggregation is an architectural paradigm that decouples LLM inference into a parallel, compute-bound prefill phase and a sequential, memory-bound decode phase.
  • The approach enhances throughput, latency, and scalability by enabling independent resource scaling and specialized hardware for each phase, as shown by significant empirically measured gains.
  • Key techniques include KV cache management, dynamic scheduling, and phase-aware hardware specialization, resulting in performance improvements such as over 7.4× rps increase and 12× tighter 99th-percentile latency SLOs.

Prefill-Decode Disaggregation (PD Disaggregation) is an architectural paradigm for efficient inference serving of large, decoder-only Transformer-based LLMs that physically and logically separates the compute-bound “prefill” phase of prompt ingestion from the memory-bound, sequential “decode” phase of token generation. By decoupling these two heterogeneous stages onto independently scaled resource pools or services, PD Disaggregation eliminates detrimental resource contention, enables hardware specialization per phase, and offers substantial, quantifiable improvements in throughput, latency, scalability, and operational cost relative to traditional monolithic GPU inference deployments (Kumar et al., 16 Oct 2025).

1. Prefill and Decode: Computational Dichotomy

In GPT-style decoder-only models, inference for a single request is inherently two-phased:

  • Prefill Phase: The model ingests a prompt of length NpN_p, executing a fully parallelized forward pass through all Transformer layers to produce both the first output token and the key/value (KV) cache necessary for attention during autoregressive generation. Each layer computes:
    • Q=XWQ, K=XWK, V=XWVQ = XW_Q,~K = XW_K,~V = XW_V,
    • followed by multi-head attention and feed-forward modules.
    • This phase is FLOP-intensive, compute-bound, and highly parallel, dominated by large matrix-matrix operations and batch-friendly GEMMs.
    • The operational metric is Time to First Token: Tfirst=Tprefill(Np)T_{\text{first}} = T_{\text{prefill}}(N_p).
  • Decode Phase: The model autoregressively generates output tokens, one at a time, using the previously constructed KV cache and model weights:
    • Each decode step retrieves cached K/V, computes matrix-vector attention, applies a feed-forward block, and SoftMax sampling.
    • This phase is memory-bandwidth-bound, with low intra-token parallelism; it is characterized by high-frequency, small-batch, bandwidth-limited memory accesses.
    • The operational metric is Inter-token Latency: LinterL_{\text{inter}}.

Resource contention arises in monolithic systems because the GPU must trade off peak FLOP utilization (best for prefill) against peak memory bandwidth (decode), leading to underutilization and high, unpredictable tail latencies—often as low as 0.2% of peak GPU utilization in overload scenarios (Kumar et al., 16 Oct 2025).

2. System Architecture and Service Decomposition

PD Disaggregation decomposes LLM inference serving into a modular, microservices-style architecture comprising three principal tiers:

Component Role/Optimization Hardware/Interface
Prefill Workers Prompt ingestion, full parallel pass, KV-cache build Compute-optimized GPUs (e.g., A100/H100); publish-KV API
Centralized KV Store Stores/shards KV-caches across layers/requests, enables paged/radix sharing High-bandwidth interconnect (NVLink, RDMA); sharded, versioned block index
Decode Workers Token-by-token generation, memory-intensive, speculative decoding Memory-optimized GPUs (e.g., A10/T4); fetches from KV-store

A front-end load balancer or smart router mediates request routing: new prompts are directed to prefill workers, subsequent decode calls tagged by context ID are dispatched to decode workers. The centralized KV store isolates compute and memory traffic, supports prefix sharing to reduce memory usage, and provides asynchronous, zero-copy transfers (Kumar et al., 16 Oct 2025).

This enables independent provisioning and autoscaling of resources for each phase—prefill and decode clusters can be sized, run, and optimized based on their characteristic bottlenecks and workload patterns.

3. Performance Modeling and Resource Utilization

PD Disaggregation allows each phase’s resource allocation to be optimized according to its dominant constraint:

  • Prefill Latency: Tprefill(Np)Npdmodel2TFLOPSGPUT_{\text{prefill}}(N_p)\propto\frac{N_p\cdot d_{\text{model}}^2}{\text{TFLOPS}_\text{GPU}}
  • Decode Token Latency: LinterSize(KV_block)BHBM+dmodel2TFLOPSdecodeL_{\text{inter}} \approx \frac{\text{Size}(\text{KV\_block})}{B_{\text{HBM}}} + \frac{d_{\text{model}}^2}{\text{TFLOPS}_{\text{decode}}}

The total sequence latency for NoN_o tokens:

Ttotal(Np,No)=Tprefill(Np)+NoLinterT_\text{total}(N_p, N_o) = T_\text{prefill}(N_p) + N_o \cdot L_\text{inter}

Throughput can trade off against latency via batching and pipelining, while network overheads for KV transfer amortize over longer outputs—per-token transfer drops to <0.2<0.2 ms for No>50N_o > 50 with 400 GB/s NVLink interconnects and 100MB/layer KV blocks (Kumar et al., 16 Oct 2025). Empirical “knee” points show diminishing returns on latency as decode clusters are scaled and network/cache lookup bottlenecks dominate.

Resource-level optimizations unlocked by PD Disaggregation include:

  • Maximal tensor-core tiling and FlashAttention for prefill, keeping memory stalls <<5%.
  • KV paging and radix prefix sharing reduce working set size 50%\sim50\% for decode, effectively doubling memory bandwidth per token.
  • Speculative decoding can overlap draft-model and main-model computations.

Empirical results include 7.4×\times rps increase, >>90% TTFT SLO compliance, and %%%%14×\times15%%%% tighter 99th^{\text{th}}-percentile SLOs versus monoliths (DistServe, AIBrix, Dynamo) (Kumar et al., 16 Oct 2025).

4. Scheduling, Autoscaling, and Deployment

PD Disaggregation decouples service scheduling, enabling distinct autoscalers for each phase:

  • Prefill: Scale based on FLOP saturation, e.g., increase NprefillN_\text{prefill} to decrease Tprefill1/NprefillT_\text{prefill}\propto 1/N_\text{prefill}.
  • Decode: Scale on HBM bandwidth, KV cache hit-rates; batch continuous decode iterations to maximize utilization.

Key deployment practices include:

  • Provision high-bandwidth network backbone (RDMA/NVLink) to keep Txfer<1msT_\text{xfer} < 1\,\text{ms} end-to-end.
  • Accept eventual consistency in the KV-store, but enforce eviction policies (LRU/cost-aware) to bound staleness.
  • Isolate failures—decode-only outage should not block prefill ingress.
  • Monitor phase-wise TTFT, inter-token latency, cache miss/hit, and network latency histograms for anomaly detection (Kumar et al., 16 Oct 2025).

Batching and speculative decoding are essential for high throughput; speculative pipelines overlap early draft-model decode with accurate main-model computation to further reduce inter-token latency.

5. Phase-Aware Hardware Specialization

Disaggregation enables hardware specialization per phase. Prefill chips maximize systolic arrays, favor cheaper GDDR memory, and de-emphasize bandwidth; decode chips shrink compute resources but retain full high-bandwidth HBM (see SPAD architecture) (Zhang et al., 9 Oct 2025). Prefill-optimized hardware can attain 8% higher throughput at half the cost; decode-optimized hardware saves 28% TDP with marginal performance loss.

Edge deployment variants exploiting PD Disaggregation use dynamic partial reconfiguration (DPR) to hot-swap attention engines, maximizing LUT/URAM utilization in FPGAs for each phase (Zhang et al., 12 Dec 2025).

6. Advanced Scheduling, Dynamic Balancing, and Adaptation

Modern serving systems extend base PD Disaggregation with dynamic scheduling and phase-aware migration strategies:

  • ARES introduces predictive decode-phase balancing via lightweight length predictors operating on LLM hidden states to anticipate future token load, reducing TPOT tail by 74.77% and more than doubling goodput under long-output workloads (Wang et al., 15 Oct 2025).
  • TaiChi unifies PD Disaggregation and aggregation, with scheduling mechanisms (“latency shifting,” flowing-decode, and length-aware prefill) and configurable resource ratios, enabling hybrid configurations that achieve up to 77% higher goodput under balanced SLOs (Wang et al., 4 Aug 2025).
  • DOPD dynamically derives and maintains the optimal P/D instance ratio online, using ARIMA-based forecasting of arrival rate, input and output lengths, thereby achieving up to 1.5×\times higher goodput and >>99% SLO attainment even under workload bursts (Liao et al., 26 Nov 2025).
  • Phase-aware pruning strategies further optimize the model for each disaggregated phase, reducing computation, memory, and communication demands with stage-specific block pruning and token-aware KV cache transmission (Zhang et al., 29 Aug 2025).

PD Disaggregation is also extensible: for large multimodal models, EPD Disaggregation further splits encoding (for images/audio) from prefill and decode, yielding even greater memory and batch-size gains (Singh et al., 25 Dec 2024); for online/offline collocation workloads, latency-strict vs. relaxed pool partitioning (OOCO) enables strict SLO compliance while maximizing offline throughput (Wu et al., 26 Nov 2025).

7. Significance, Limitations, and Forward Directions

PD Disaggregation has become the dominant industrial deployment paradigm for large-context and throughput-sensitive LLM inference, enabling modular, autoscalable, cloud-native serving at enterprise scale (Kumar et al., 16 Oct 2025).

Key benefits, directly verified in empirical evaluations:

  • Predictable and low tail latency for both TTFT and per-token decode.
  • Order-of-magnitude throughput gains and near-linear scalability.
  • Hardware and cloud cost reductions exceeding 40% by phase-specialized resource provisioning (Zhang et al., 9 Oct 2025).
  • Immediate applicability to advanced serving stacks (DistServe, Dynamo, Mooncake, vLLM, etc.) and extension to retrieval-augmented scenarios (Liu et al., 1 Dec 2025).

Current limitations include increased system complexity; deployment of elastic scaling and adaptive scheduling policies is essential to avoid producer-consumer imbalance and resource underutilization under skewed or bursty workloads (Mitra et al., 5 Jun 2025, Liao et al., 26 Nov 2025). Scaling further requires robust network infrastructure for low-latency KV transfer and ongoing research targets optimal pruning, phase-aware quantization, and dynamic adaptation in multi-tenancy or edge environments.

PD Disaggregation, together with its variants and extensions, is fundamental for the efficient, scalable, and cost-effective serving of ever-larger LLMs in modern cloud and edge inference deployments.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Prefill-Decode Disaggregation (PD Disaggregation).