Disaggregated LLM Inference
- Disaggregated LLM inference is a paradigm that splits the traditional monolithic pipeline into independent microservices (tokenization, embedding, prefill, decode) tailored for specific compute/memory tasks.
- The approach maximizes hardware utilization and minimizes tail latency by isolating phases, enabling advanced scheduling, batching, and dynamic resource allocation.
- Techniques such as RDMA-based KV transfers, pipeline parallelism, and sharded attention are employed to achieve notable throughput improvements and cost efficiency.
Disaggregate LLM Inference is a systems-level paradigm that decomposes the canonical monolithic inference pipeline for transformer-based LLMs into physically and logically independent computational microservices, each specialized for a substage such as tokenization, embedding, prompt prefill (KV-cache construction), and autoregressive decode. This architecture contrasts sharply with traditional single-replica or multi-replica designs and is motivated by marked phase heterogeneity in compute/memory requirements, network characteristics, and latency-sensitivity within the LLM inference workflow. The principal goal of disaggregation is to maximize hardware utilization, minimize tail latency, and optimize resource allocation for production LLM serving at scale (Li et al., 17 Jul 2024).
1. Architectural Principles and Patterns
The disaggregated inference pipeline decomposes into distinct microservices mapped to logical stages:
- Tokenization Service: A CPU-based microservice that converts input text into fixed-length arrays of token IDs. Typical latency is negligible but critical as the first step.
- Embedding Lookup Service: Runs on CPU or GPU depending on the model. Acts as a vector-gather RPC over the embedding table, with lookup latency often hidden by pipelining.
- Prefill Service (Prompt Encoding): Processes one-time prompt encoding over large batches on high-memory GPUs. Responsible for building the complete KV-cache for each request.
- KV-Cache Management: Specialized attention memory handlers (e.g., RingAttention, PagedAttention, vAttention) manage per-request key/value pages, shard state across devices, or support demand paging to optimize cache locality.
- Decode Service: Autoregressive token generation, mapped to low-latency GPUs or even CPUs, providing one-token-at-a-time output with token-level batching.
- Output Projection Service: Optionally isolated as a thin GEMM microservice.
Microservice-pipeline disaggregation (prefill/decode split) is foundational, but further disaggregation includes per-layer/component fanning (pipeline parallelism), expert-service routing in MoE models, and sharded attention-head execution (Li et al., 17 Jul 2024).
Example Disaggregated Pipeline Steps
| Substage | Typical Execution Target | Key Service Characteristics |
|---|---|---|
| Tokenization | CPU | Fixed-latency, handoff to embedding |
| Embedding Lookup | CPU/GPU | Batched RPCs, pipelined |
| Transformer Layer(s) | GPU cluster/node | Blockwise pipeline/fan-out/fan-in |
| KV-Cache Management | Specialized memory service | Paged, sharded, virtual-contiguous |
| Output Projection | Thin GPU | Often fused, but can be specialized |
| Decoding | Low-latency GPU | Token-granular, continuous batching |
Patterns such as pipeline parallelism, Mixture-of-Experts dynamic routing, fan-out/fan-in across attention heads or layers, and cluster-wide KV-cache sharding support further scale-out and fine-grained modularity (Li et al., 17 Jul 2024).
2. Latency, Throughput, and Resource Optimization
Disaggregation introduces inter-service RPC, network-hops, and serialization boundaries, increasing per substage. However, it enables substantial system-level gains:
- Throughput increases as microservices batch more aggressively over homogeneous work units (large prompt batches during prefill, small dynamic batches during decode).
- Tail Latency is dramatically reduced by phase isolation; for instance, TetriInfer reports up to 40% reduction in tail token-level latency using a two-level scheduling mechanism (Hu et al., 20 Jan 2024).
- Resource Utilization is enhanced by specialization: high-memory bandwidth or larger HBM GPUs are reserved for prefill, while low-latency, cost-efficient GPUs are assigned to decode (Li et al., 17 Jul 2024, Bournias et al., 8 Nov 2024).
- Cost-Efficiency is improved; Splitwise achieves up to 2× better GPU cost efficiency by specializing prefill on A100s and decode on T4s (Li et al., 17 Jul 2024).
A system-level latency model for a fully disaggregated inference request:
Batching and scheduling are optimized under throughput and latency SLOs by trading off batch size and per-token computation/communication overheads (Li et al., 17 Jul 2024). Empirically, throughput and tail latency improvements are observed across production-like workloads (Li et al., 17 Jul 2024, Pan et al., 27 Jun 2025, Bournias et al., 8 Nov 2024).
3. Algorithmic Mechanisms and Practical Implementations
Several systems exemplify distinct algorithmic choices for disaggregated LLM inference:
- KVDirect: Implements distributed disaggregation by decoupling prefill and decode across nodes with a tensor-centric, RDMA-based KV cache transfer layer. Pull-based KV transfer allows decode workers to fetch cache data on-demand, reducing per-request latency by 55% compared to vLLM. Effective bandwidth utilization saturates the network link capacity, and decode-side GPU idle time is minimized (Chen et al., 13 Dec 2024).
- AcceLLM: Utilizes redundancy in KV-cache copies across paired accelerators to balance workload, tolerating stragglers and smoothing tail TBT. Redundancy factor is dynamically set to fit available memory. Experimental results demonstrate up to 30% improvements in latency and efficiency, with near-ideal hardware utilization (Bournias et al., 8 Nov 2024).
- Harli: Addresses decode-phase underutilization by co-locating parameter-efficient finetuning tasks with inference decode, balancing memory and compute demands with unified memory management and QoS-constrained scheduling. Achieves up to 92% higher finetune throughput while maintaining strict decode latency constraints (Xu et al., 13 Nov 2025).
- TetriInfer: Enforces fixed-size, chunked prompt prefill and a two-level, resource-aware scheduler for prefill and decode assignment. Systematically partitions workloads, uses length-prediction buckets for anticipating decode resource needs, and reduces tail latency and overall resource cost (Hu et al., 20 Jan 2024).
- TD-Pipe: Temporally disaggregates pipeline parallelism, executing extended prefill and decode bursts to eliminate phase-switch bubbles. AI-based prefill, greedy memory simulation, inter-batch work stealing, and spatial-temporal intensity metrics yield throughput gains up to 2.7× over traditional pipeline approaches (Zhang et al., 12 Jun 2025).
These systems employ microservice orchestration via Kubernetes, workload-aware scheduling, pull- vs. push-mode KV transfer, dynamic batching, speculative decode, and both hardware-level (e.g., RDMA, NVLink) and software-level (e.g., paged attention) optimizations.
4. Microarchitectural and Performance Analysis
Systematic GPU profiling and queueing analysis reveal the core rationale for disaggregation:
- Prefill is compute-bound: high GPU SM utilization (80–90%), high arithmetic intensity ( FLOPs/byte), large batched GEMMs, and good cache reuse.
- Decode is memory-bound: SM utilization falls to ≈30%, DRAM bandwidth saturates ($200–300$ GB/s), and L2 hit rate drops (<40%), as per studies on Llama-3, Qwen2.5, and others (Wang et al., 1 Dec 2025).
- Network and memory architecture become decisive bottlenecks under disaggregation, especially as KV caches are transferred at multi-GB/s rates and require careful attention management (paging/sharding vs. monolithic residency) (Chen et al., 13 Dec 2024, Li et al., 17 Jul 2024).
- Disaggregated systems enable phase-aware placement, such as placing prefill on compute-optimized (FLOP-rich) GPUs and decode on memory-rich or latency-optimized GPUs (Li et al., 17 Jul 2024, Kumar et al., 16 Oct 2025).
- Energy consumption is decode-dominated; strategies such as output-projection quantization or cache locality grouping directly target the memory wall in decode (Wang et al., 1 Dec 2025).
The move to decoupled microservices also facilitates fine-grained fault isolation, enables elastic scaling per phase, and permits queueing-theoretic load balancing algorithms (power-of-two-choice, cache-aware assignment) (Li et al., 17 Jul 2024, Pan et al., 27 Jun 2025).
5. Case Studies and Comparative Evaluation
Empirical results across public systems and benchmarks illustrate the concrete benefits of disaggregation:
- TetriInfer: Yields up to 97% reduction in TTFT and 47% lower JCT on mixed prefill/decode workloads versus monolithic baselines, with 38% resource cost reduction (Hu et al., 20 Jan 2024).
- KVDirect: Demonstrates 55% latency reduction (P90) on ArXiv workloads, with KV transfer constituting only 0.5–1.1% of total latency. Pull-mode transfer reshapes decode GPU idling dynamics, enhancing queue discipline (Chen et al., 13 Dec 2024).
- AcceLLM: Achieves up to 30% higher throughput and 300% lower tail TBT than Splitwise/vLLM, with memory overhead kept to ≈5 GB/instance (Bournias et al., 8 Nov 2024).
- TD-Pipe: 1.91–2.73× throughput increase over tensor/pipeline parallel approaches on PCIe clusters, by removing pipeline bubbles via temporally disaggregated switching (Zhang et al., 12 Jun 2025).
- Harli: Maintains <40 ms decode time-per-output-token (TPOT) even while co-locating PEFT jobs, with near-theoretical GPU occupancy (Xu et al., 13 Nov 2025).
Distinctive mechanisms such as per-layer KV streaming, unified memory allocation, and chunked prefill batching support these results. For long-context or high-throughput serving scenarios, sharded KV management (RingAttention/Infinite-LLM) and dynamic cache paging (vAttention/PagedAttention) enable cluster-scale scalability with minimal software overhead (Li et al., 17 Jul 2024).
6. Open Challenges and Future Directions
Despite their operational advantages, disaggregated LLM inference systems face significant unresolved problems:
- Network Overhead: High-volume KV cache transfers (often tens of MB/request) stress PCIe/NVLink fabrics; sustained scaling demands further protocol and hardware refinement (Pan et al., 27 Jun 2025).
- Adaptive Autoscaling: Jointly tracking SLOs and optimally adjusting prefill/decode fleet sizes remains unsolved at production scale; current practice relies on custom orchestration (Pan et al., 27 Jun 2025, Kumar et al., 16 Oct 2025).
- Fault Tolerance: Exactly-once semantics, stateful failover, and cache-consistency protocols are non-trivial when prefills and decoders fail independently (Pan et al., 27 Jun 2025).
- Cache Persistence and Sharing: Indexing and managing KV entries across multi-tenant workloads and requests challenge cache management primitives and raise safety/isolation considerations (Pan et al., 27 Jun 2025, Kumar et al., 16 Oct 2025).
- Phase Prediction and Load Estimation: Highly accurate output-length and resource-prediction models are critical for effective scheduling but generalize poorly across prompt and model heterogeneity (Li et al., 17 Jul 2024, Hu et al., 20 Jan 2024).
- Integration with Vertical Scaling and Serverless: Smooth migration between disaggregated, monolithic, and serverless deployments remains an ecosystem bottleneck (Pan et al., 27 Jun 2025).
Further optimization opportunities exist in hardware-software co-design for RDMA fabrics, speculative decode to overlap phases, and hierarchical multi-tier caching regimes for KV management at trillion-parameter scales.
Disaggregate LLM inference represents a paradigm shift in LLM serving, delivering order-of-magnitude improvements in latency, throughput, and cost by aligning compute/memory/network characteristics of each inference substage to its optimal hardware/software microservice (Li et al., 17 Jul 2024, Chen et al., 13 Dec 2024, Bournias et al., 8 Nov 2024, Xu et al., 13 Nov 2025, Pan et al., 27 Jun 2025, Kumar et al., 16 Oct 2025, Wang et al., 1 Dec 2025). Modularity, fine-grained scheduling, dynamic load balancing, and resource-specialized allocation are now established as necessary foundations for efficient, scalable, production-grade LLM inference.