Disaggregated LLM Serving Infrastructure

Updated 24 November 2025

Disaggregated LLM serving infrastructure is a modular approach that partitions inference stages (prefill and decode) across heterogeneous resources to meet specific compute and memory demands.
It leverages phase and compute–memory disaggregation along with optimized communication and adaptive scheduling to enhance throughput and reduce latency.
Empirical results demonstrate up to 2× throughput gains and 30–40% cost savings, validating its efficacy for scalable, production-grade AI deployments.

Disaggregated LLM serving infrastructure partitions the major computational and memory-intensive phases of LLM inference—such as prefill (prompt encoding) and decode (autoregressive token generation)—onto distinct machine pools, accelerators, and memory tiers. This architectural paradigm addresses the inherent resource heterogeneity in LLM workloads, unlocks high hardware utilization, and enables scalable, cost-effective deployment at production scale, as demonstrated by systems such as DistServe, HexGen-2, BanaServe, DéjàVu, Mooncake, MegaScale-Infer, and xDeepServe (Zhong et al., 2024, Jiang et al., 11 Feb 2025, He et al., 15 Oct 2025, Strati et al., 2024, Qin et al., 2024, Zhu et al., 3 Apr 2025, Xiao et al., 4 Aug 2025). The design combines optimized communication strategies, adaptive scheduling, and fault-tolerant mechanisms to meet stringent service-level objectives (SLOs) for both latency and throughput over large, distributed, and often heterogeneous accelerator clusters.

1. Architectural Principles and Taxonomy

Disaggregation in LLM serving has four primary forms:

Phase (Prefill–Decode) Disaggregation: Splits prompt computation (prefill) and step-wise decoding onto separate compute pools or device types, matching phase resource needs (compute-bound vs. memory-bandwidth-bound) (Zhong et al., 2024, He et al., 15 Oct 2025, Jiang et al., 11 Feb 2025).
Compute–Memory Disaggregation: Offloads the large, growing key–value (KV) cache to external memory nodes (host DRAM, SSD, or remote RDMA-attached memory) while retaining compute on accelerators (Li et al., 2024, Qin et al., 2024).
Accelerator Pooling and Heterogeneous Deployments: Schedules distinct workload phases onto specialized hardware, e.g., high-FLOPS GPUs for prefill and high-HBM-bandwidth GPUs for decode (Chen et al., 22 Sep 2025, Jiang et al., 11 Feb 2025).
Module/Operator Disaggregation (MoE/Attention): Decouples transformer attention, feed-forward, and Mixture-of-Experts modules, allowing them to scale and run independently (Zhu et al., 3 Apr 2025, Xiao et al., 4 Aug 2025).

These approaches are architected via a multi-tier serving stack. Typical layers include (a) a parameter store for model weights, (b) a cache tier for activations, (c) distributed compute engines (grouped by phase), and (d) high-speed network fabrics (NVLink, InfiniBand, RoCE) coordinating phase transitions and cache transfer.

2. Core Mechanisms and Dataflows

A canonical disaggregated serving workflow operates as follows (Zhong et al., 2024, Jiang et al., 11 Feb 2025):

Request Admission: Scheduler receives a prompt, determines input (prefill) and expected output (token count).
Prefill Execution: Request assigned to prefill pool; full prompt executed in parallel with customized parallelism (tensor/model/pipeline), producing initial token(s) and the KV cache for all layers.
KV-cache Transfer: KV cache, typically tens to hundreds of MB, is serialized and transferred (over NVLink/RDMA/Ethernet) to the decode pool, either via collective or one-sided direct access (e.g., GPUDirect RDMA) (Chen et al., 2024, Qin et al., 2024).
Decoding Execution: Decode pool iteratively runs small-batch, memory-bound autoregressive steps, streaming output tokens. Continuous batching and fine-grained scheduling optimize for time per output token (TPOT) (Li et al., 2024, Zhong et al., 2024).
Cache Management: Disaggregated systems offload inactive or intermediate KV cache to secondary tiers (CPU RAM, SSD) (Qin et al., 2024, He et al., 15 Oct 2025).

For module-level disaggregation or MoE, further splitting is applied at the transformer layer, decoupling attention and feed-forward paths, which are realized across heterogeneous pools with tailored communication (e.g., ping-pong pipelining, expert-parallel routing) (Zhu et al., 3 Apr 2025, Xiao et al., 4 Aug 2025).

3. Scheduling, Load Balancing, and Autoscaling

Optimal efficiency in disaggregated settings requires fine-grained scheduling, continuous adaptation, and hardware-aware resource allocation:

Static and Adaptive Partitioning: Early systems (DistServe, vLLM-disaggregated) statically partitioned GPU pools for prefill and decode. Adaptive schedulers (Arrow, HeteroScale) react to real-time metrics (queue length, utilization, SLO attainment) to elastically allocate instances to roles and automatically rebalance under bursts or workload skew (Wu et al., 17 May 2025, Li et al., 27 Aug 2025).
Mathematical Models: Placement is formalized as a mixed-integer linear program (MILP) or as network flow optimization, maximizing throughput subject to GPU and network constraints:

$\max_{A,t,s,f} \sum_{g, h} f_{g\to h}$

Subject to per-GPU and per-link bandwidth/memory limits, and phase-specific SLOs (Jiang et al., 11 Feb 2025).

Cluster-Wide Scaling: Coordinated autoscaling (HeteroScale) uses a primary metric (e.g., decode TPS) for proportional control of joint prefill and decode pools, aligned to network topology to minimize data transfer bottlenecks and keep P/D ratios optimal (Li et al., 27 Aug 2025).
Module-level Load Balancing: For MoE or attention disaggregation, algorithms such as expert load balancing, ping-pong pipeline parallelism, or attention-head migration are used to allocate resources dynamically, reduce stragglers, and saturate device throughput (Zhu et al., 3 Apr 2025, He et al., 15 Oct 2025, Xiao et al., 4 Aug 2025).

4. Communication and Fault Tolerance

Disaggregation by definition introduces network overhead and necessitates optimized communication primitives:

High-Performance RDMA and Direct Memory Access: Custom communication layers (e.g., MegaScale-Infer's M2N, xDeepServe's XCCL, DéjàVuLib) eliminate GPU–CPU copies and expose tensor-centric, point-to-point and all-to-all primitives, realizing up to 4.2× throughput vs. NCCL for small messages and stable tail latencies as system scale grows (Zhu et al., 3 Apr 2025, Xiao et al., 4 Aug 2025, Strati et al., 2024).
KV Cache Strategies: Pull-based transfer (KVDirect) removes decode-side idling and reduces total latency compared to push mode; CPU/SSD-backed global KV stores enable cache load-insensitive routing and dynamic prefix reuse (Chen et al., 2024, He et al., 15 Oct 2025, Qin et al., 2024).
Microbatch Swapping and Memory Tiering: KV cache for idle microbatches can be asynchronously swapped to host memory (as in DéjàVu), significantly reducing GPU RAM requirements (e.g., 2× RAM reduction, 1.8× batch size increase) (Strati et al., 2024, Qin et al., 2024).
Fault Tolerance: State replication (per-microbatch and per-token) between neighbor workers allows sub-second recovery from device failures, with recovery time $T_{\mathrm{recover}} = T_{\mathrm{fetch\_replica}} + T_{\mathrm{replay}}$ , achieving 1.24× slowdown vs. 1.89× in non-replicated systems (Strati et al., 2024).

5. Heterogeneous and Modular Deployment

Disaggregated LLM serving unlocks previously unattainable heterogeneity and modularity in large clusters:

Heterogeneous Accelerator Support: Workload phases are scheduled to the hardware tier (GPU/CPU/NPU) best matched to their resource profile; e.g., high-TFLOPS devices for prefill, high-HBM for decode, and optimal cost-performance ratios via joint search over parallelism settings and instance counts (Chen et al., 22 Sep 2025, Jiang et al., 11 Feb 2025, Xiao et al., 4 Aug 2025).
Cross-vendor and Multi-generation Compatibility: Systems such as HexGen-2 and P-D (multi-vendor) remove GPU-vendor lock-in by flattening/retile tensor formats and accommodating different parallelisms in KV cache alignment and control (Jiang et al., 11 Feb 2025, Chen et al., 22 Sep 2025).
Unified, Programmable APIs: LLM microserving exposes sub-request-level REST endpoints (prep_recv, remote_send, start_generate) and a programmable router, unifying data- and model-parallel, phase-disaggregated, and prefix-migration patterns in a consistent interface (Jin et al., 2024).
Autoscaling and Fine-Grained Elasticity: Stateless instance pools (Arrow) and partially disaggregated rolling activation (EcoServe) permit efficient, SLO-aware scaling of capacity with superlinear gains in request serving rates over static architectures (Wu et al., 17 May 2025, Du et al., 25 Apr 2025).

6. Empirical Performance and Limitations

Disaggregated serving infrastructures consistently report large improvements in utilization, throughput, and SLO compliance:

Throughput Gains: Up to 2× overall throughput and order-of-magnitude increases in strict SLO rates are observed across OPT, Llama, and Mixture-of-Expert models (Strati et al., 2024, Zhong et al., 2024, Zhu et al., 3 Apr 2025, He et al., 15 Oct 2025).
Latency Reductions: P99 time-to-first-token (TTFT) and time-per-output-token (TPOT) reduce by 20–80% in production microbenchmarks, even under high-load and long-context scenarios (Chen et al., 2024, He et al., 15 Oct 2025, Qin et al., 2024).
Cost Savings: Heterogeneous and modular partitioning reduce cost per token by 30–40%, and coordinated scaling saves hundreds of thousands of GPU-hours daily at cloud scale (Jiang et al., 11 Feb 2025, Li et al., 27 Aug 2025, Chen et al., 22 Sep 2025).
Resource Utilization: Adaptive scheduling and balanced parallelism increase GPU utilization from 60% (static colocated) to 85% (fully adaptive disaggregated) (Wu et al., 17 May 2025, Chen et al., 2024).

Limitations persist in the need for very high-bandwidth networks to avoid cache-transfer bottlenecks in fully disaggregated settings (minimum 40+ GB/s per node recommended for SLO compliance) (Du et al., 25 Apr 2025, Zhong et al., 2024), and in the increased system complexity and interdependence introduced by modular, cross-pool policies. Systems such as EcoServe and semi-PD propose compromise between full and partial disaggregation to better accommodate commodity settings or cluster-level storage constraints (Du et al., 25 Apr 2025, Hong et al., 28 Apr 2025).

7. Synthesis and Forward Directions

Disaggregated LLM serving infrastructures are now considered foundational to large-scale, high-throughput, and cost-sensitive generative AI deployment. They enable independent scaling of phase- or operator-specific compute, memory, and interconnect resources, transparent support for multivendor and multi-generational hardware, and robust, SLO-aware cluster operation. Key open areas for future development include:

Predictive, ML-based workload-driven phase and module scheduling (He et al., 15 Oct 2025, Jiang et al., 11 Feb 2025).
WAN–optimized KV cache synchronization for geo-distributed and hybrid-cloud deployments (He et al., 15 Oct 2025).
Integration of adaptive compression, quantization, and storage tiering in cache-centric architectures (Li et al., 2024, Qin et al., 2024).
Hierarchical scheduling and intelligent placement for nested disaggregation (e.g., intra-phase, expert routing) (Xiao et al., 4 Aug 2025, Zhu et al., 3 Apr 2025).
Robust, zero-downtime upgrades and fault-tolerance through further modularization and stateless orchestration (Wu et al., 17 May 2025, He et al., 15 Oct 2025, Strati et al., 2024).

With an expanding ecosystem of production-grade frameworks and research systems, and as both model and infrastructure complexity scale, disaggregation is poised to remain the dominant pattern for next-generation LLM inference services.