Papers
Topics
Authors
Recent
2000 character limit reached

Disaggregated LLM Serving Infrastructure

Updated 24 November 2025
  • Disaggregated LLM serving infrastructure is a modular approach that partitions inference stages (prefill and decode) across heterogeneous resources to meet specific compute and memory demands.
  • It leverages phase and compute–memory disaggregation along with optimized communication and adaptive scheduling to enhance throughput and reduce latency.
  • Empirical results demonstrate up to 2× throughput gains and 30–40% cost savings, validating its efficacy for scalable, production-grade AI deployments.

Disaggregated LLM serving infrastructure partitions the major computational and memory-intensive phases of LLM inference—such as prefill (prompt encoding) and decode (autoregressive token generation)—onto distinct machine pools, accelerators, and memory tiers. This architectural paradigm addresses the inherent resource heterogeneity in LLM workloads, unlocks high hardware utilization, and enables scalable, cost-effective deployment at production scale, as demonstrated by systems such as DistServe, HexGen-2, BanaServe, DéjàVu, Mooncake, MegaScale-Infer, and xDeepServe (Zhong et al., 18 Jan 2024, Jiang et al., 11 Feb 2025, He et al., 15 Oct 2025, Strati et al., 4 Mar 2024, Qin et al., 24 Jun 2024, Zhu et al., 3 Apr 2025, Xiao et al., 4 Aug 2025). The design combines optimized communication strategies, adaptive scheduling, and fault-tolerant mechanisms to meet stringent service-level objectives (SLOs) for both latency and throughput over large, distributed, and often heterogeneous accelerator clusters.

1. Architectural Principles and Taxonomy

Disaggregation in LLM serving has four primary forms:

These approaches are architected via a multi-tier serving stack. Typical layers include (a) a parameter store for model weights, (b) a cache tier for activations, (c) distributed compute engines (grouped by phase), and (d) high-speed network fabrics (NVLink, InfiniBand, RoCE) coordinating phase transitions and cache transfer.

2. Core Mechanisms and Dataflows

A canonical disaggregated serving workflow operates as follows (Zhong et al., 18 Jan 2024, Jiang et al., 11 Feb 2025):

  1. Request Admission: Scheduler receives a prompt, determines input (prefill) and expected output (token count).
  2. Prefill Execution: Request assigned to prefill pool; full prompt executed in parallel with customized parallelism (tensor/model/pipeline), producing initial token(s) and the KV cache for all layers.
  3. KV-cache Transfer: KV cache, typically tens to hundreds of MB, is serialized and transferred (over NVLink/RDMA/Ethernet) to the decode pool, either via collective or one-sided direct access (e.g., GPUDirect RDMA) (Chen et al., 13 Dec 2024, Qin et al., 24 Jun 2024).
  4. Decoding Execution: Decode pool iteratively runs small-batch, memory-bound autoregressive steps, streaming output tokens. Continuous batching and fine-grained scheduling optimize for time per output token (TPOT) (Li et al., 17 Jul 2024, Zhong et al., 18 Jan 2024).
  5. Cache Management: Disaggregated systems offload inactive or intermediate KV cache to secondary tiers (CPU RAM, SSD) (Qin et al., 24 Jun 2024, He et al., 15 Oct 2025).

For module-level disaggregation or MoE, further splitting is applied at the transformer layer, decoupling attention and feed-forward paths, which are realized across heterogeneous pools with tailored communication (e.g., ping-pong pipelining, expert-parallel routing) (Zhu et al., 3 Apr 2025, Xiao et al., 4 Aug 2025).

3. Scheduling, Load Balancing, and Autoscaling

Optimal efficiency in disaggregated settings requires fine-grained scheduling, continuous adaptation, and hardware-aware resource allocation:

  • Static and Adaptive Partitioning: Early systems (DistServe, vLLM-disaggregated) statically partitioned GPU pools for prefill and decode. Adaptive schedulers (Arrow, HeteroScale) react to real-time metrics (queue length, utilization, SLO attainment) to elastically allocate instances to roles and automatically rebalance under bursts or workload skew (Wu et al., 17 May 2025, Li et al., 27 Aug 2025).
  • Mathematical Models: Placement is formalized as a mixed-integer linear program (MILP) or as network flow optimization, maximizing throughput subject to GPU and network constraints:

maxA,t,s,fg,hfgh\max_{A,t,s,f} \sum_{g, h} f_{g\to h}

Subject to per-GPU and per-link bandwidth/memory limits, and phase-specific SLOs (Jiang et al., 11 Feb 2025).

  • Cluster-Wide Scaling: Coordinated autoscaling (HeteroScale) uses a primary metric (e.g., decode TPS) for proportional control of joint prefill and decode pools, aligned to network topology to minimize data transfer bottlenecks and keep P/D ratios optimal (Li et al., 27 Aug 2025).
  • Module-level Load Balancing: For MoE or attention disaggregation, algorithms such as expert load balancing, ping-pong pipeline parallelism, or attention-head migration are used to allocate resources dynamically, reduce stragglers, and saturate device throughput (Zhu et al., 3 Apr 2025, He et al., 15 Oct 2025, Xiao et al., 4 Aug 2025).

4. Communication and Fault Tolerance

Disaggregation by definition introduces network overhead and necessitates optimized communication primitives:

  • High-Performance RDMA and Direct Memory Access: Custom communication layers (e.g., MegaScale-Infer's M2N, xDeepServe's XCCL, DéjàVuLib) eliminate GPU–CPU copies and expose tensor-centric, point-to-point and all-to-all primitives, realizing up to 4.2× throughput vs. NCCL for small messages and stable tail latencies as system scale grows (Zhu et al., 3 Apr 2025, Xiao et al., 4 Aug 2025, Strati et al., 4 Mar 2024).
  • KV Cache Strategies: Pull-based transfer (KVDirect) removes decode-side idling and reduces total latency compared to push mode; CPU/SSD-backed global KV stores enable cache load-insensitive routing and dynamic prefix reuse (Chen et al., 13 Dec 2024, He et al., 15 Oct 2025, Qin et al., 24 Jun 2024).
  • Microbatch Swapping and Memory Tiering: KV cache for idle microbatches can be asynchronously swapped to host memory (as in DéjàVu), significantly reducing GPU RAM requirements (e.g., 2× RAM reduction, 1.8× batch size increase) (Strati et al., 4 Mar 2024, Qin et al., 24 Jun 2024).
  • Fault Tolerance: State replication (per-microbatch and per-token) between neighbor workers allows sub-second recovery from device failures, with recovery time Trecover=Tfetch_replica+TreplayT_{\mathrm{recover}} = T_{\mathrm{fetch\_replica}} + T_{\mathrm{replay}}, achieving 1.24× slowdown vs. 1.89× in non-replicated systems (Strati et al., 4 Mar 2024).

5. Heterogeneous and Modular Deployment

Disaggregated LLM serving unlocks previously unattainable heterogeneity and modularity in large clusters:

  • Heterogeneous Accelerator Support: Workload phases are scheduled to the hardware tier (GPU/CPU/NPU) best matched to their resource profile; e.g., high-TFLOPS devices for prefill, high-HBM for decode, and optimal cost-performance ratios via joint search over parallelism settings and instance counts (Chen et al., 22 Sep 2025, Jiang et al., 11 Feb 2025, Xiao et al., 4 Aug 2025).
  • Cross-vendor and Multi-generation Compatibility: Systems such as HexGen-2 and P-D (multi-vendor) remove GPU-vendor lock-in by flattening/retile tensor formats and accommodating different parallelisms in KV cache alignment and control (Jiang et al., 11 Feb 2025, Chen et al., 22 Sep 2025).
  • Unified, Programmable APIs: LLM microserving exposes sub-request-level REST endpoints (prep_recv, remote_send, start_generate) and a programmable router, unifying data- and model-parallel, phase-disaggregated, and prefix-migration patterns in a consistent interface (Jin et al., 17 Dec 2024).
  • Autoscaling and Fine-Grained Elasticity: Stateless instance pools (Arrow) and partially disaggregated rolling activation (EcoServe) permit efficient, SLO-aware scaling of capacity with superlinear gains in request serving rates over static architectures (Wu et al., 17 May 2025, Du et al., 25 Apr 2025).

6. Empirical Performance and Limitations

Disaggregated serving infrastructures consistently report large improvements in utilization, throughput, and SLO compliance:

Limitations persist in the need for very high-bandwidth networks to avoid cache-transfer bottlenecks in fully disaggregated settings (minimum 40+ GB/s per node recommended for SLO compliance) (Du et al., 25 Apr 2025, Zhong et al., 18 Jan 2024), and in the increased system complexity and interdependence introduced by modular, cross-pool policies. Systems such as EcoServe and semi-PD propose compromise between full and partial disaggregation to better accommodate commodity settings or cluster-level storage constraints (Du et al., 25 Apr 2025, Hong et al., 28 Apr 2025).

7. Synthesis and Forward Directions

Disaggregated LLM serving infrastructures are now considered foundational to large-scale, high-throughput, and cost-sensitive generative AI deployment. They enable independent scaling of phase- or operator-specific compute, memory, and interconnect resources, transparent support for multivendor and multi-generational hardware, and robust, SLO-aware cluster operation. Key open areas for future development include:

With an expanding ecosystem of production-grade frameworks and research systems, and as both model and infrastructure complexity scale, disaggregation is poised to remain the dominant pattern for next-generation LLM inference services.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Disaggregated LLM Serving Infrastructure.