xDeepServe LLM Serving
- xDeepServe is a large language model serving system that disaggregates transformer components into independently scheduled modules for optimized inference.
- It employs innovative disaggregated execution models, including separate prefill–decode and MoE–attention strategies, to enhance load balancing and reduce latency.
- The system integrates advanced communication protocols and decentralized scheduling across NPUs to efficiently scale MoE architectures on supercomputing infrastructure.
xDeepServe is a LLM serving system designed for SuperPod-scale infrastructure and developed by Huawei Cloud. It is architected to enable the scalable, efficient, and reliable inference of contemporary and future LLMs—especially those leveraging large-scale Mixture-of-Experts (MoE) designs—on ascendant AI supercomputers such as the CloudMatrix384 SuperPod. Central to xDeepServe is its fully disaggregated “Transformerless” architecture, which isolates key transformer building blocks—attention, feedforward networks, and MoE experts—into independently scheduled computation modules, each mapped to dedicated Neural Processing Units (NPUs) and interconnected via high-bandwidth, global shared memory. This approach addresses the challenges of resource scaling, load balancing, and synchronization inherent in deploying large MoE models on modern supercomputing fabrics (Xiao et al., 4 Aug 2025).
1. Disaggregated Architecture and Transformerless Design
xDeepServe employs a design in which transformer models are physically and logically decomposed into their constituent modules—specifically, the attention, feedforward (FFN), and MoE expert layers. Each module is dispatched to its own set of NPUs on the CloudMatrix384, exploiting the hardware’s hundreds of GB/s global shared memory interconnect. As a result, compute-bound and memory-bound phases are isolated and can be scaled or scheduled independently.
The architecture, denoted as “Transformerless” (an Editor's term), breaks from the conventional strategy of monolithic layerwise computation. The independence of each module allows, for example, the memory-intensive KV-caching in attention modules to be managed on dedicated NPUs with large memory, while compute-intensive MoE expert layers can be deployed across stateless NPUs optimized for throughput. The diagram provided in the source (“xDeepServe Architecture over CloudMatrix384 SuperPod”) illustrates NPUs connected via global shared memory, each assigned to a specific model function.
2. Disaggregated Execution Models
xDeepServe implements two principal execution strategies to maximize scalability and efficiency:
- Disaggregated Prefill–Decode: The compute-heavy “prefill” stage (which accommodates variable input lengths and dynamic graph execution) is separated from the memory-bound “decode” stage (with static, pre-compiled graphs). Differentiated tensor parallelism—e.g., TP=4 in prefill versus TP=1 in decode—aligns the hardware allocation to the phase-specific workload. Prefill NPUs generate KV cache blocks, passing only metadata and pointers to decode NPUs over the high-speed interconnect. Data movement between domains is mediated via a specialized DistFlow engine leveraging XCCL primitives.
- Disaggregated MoE–Attention: For large MoE models, MoE expert computation and attention (with its stateful KV cache) are offloaded to different NPU groups. Routing between attention and MoE occurs using bespoke all-to-all primitives (A2E and E2A). The “trampoline forward” mechanism aggregates data in a subset of expert NPUs, reducing fan-out overhead and distributing load before dispatching to all experts.
This paradigm allows modules to be independently scheduled, optimized, and scaled; contention between batch-sensitive (MoE) and sequence-sensitive (attention) operations is substantially mitigated.
3. Scalability, Performance, and Expert Load Balancing
Disaggregated modules mapped to dedicated NPUs ensure that compute and memory can be scaled independently—attention and MoE components are both optimized for their respective operational domains. System benchmarks in the source describe a deployment example for DeepSeek-R1/V3, achieving 2400 tokens/s/chip (Ascend 910C), meeting average time-per-output-token (TPOT) of 50 ms, and sustaining sub-100 ms per iteration even under global synchronization constraints.
A core challenge addressed is expert load balancing in the MoE layers, characterized by token distribution across experts. The system defines, for each layer and time :
This formalism quantifies expert imbalance, guiding token routing strategies for efficient utilization and minimizing tail latency across MoE dispatch and combine operations.
4. XCCL Communication Library and Memory-Semantic Primitives
A custom communication library, XCCL, underpins xDeepServe’s distributed execution. Designed for CloudMatrix384’s global shared memory, XCCL implements:
- Point-to-Point Primitives: Direct transmission of data (such as KV cache segments) between NPUs over the shared memory space.
- All-to-All Primitives: Expert parallelism operations—such as dispatch, combine, A2E, E2A—implemented via “memory-semantic” protocols.
- Memory Partitioning: Each NPU’s on-chip memory is subdivided into application data, managed data, and metadata areas. Metadata entries (32 bytes) encode eventID, chunkID, and tail pointers for each peer connection.
- Transfer Coordination: Dedicated hardware engines (e.g., MTE2/MTE3 and DMA) facilitate low-latency, bulk transfers. Protocol operation is event-driven: metadata updates trigger busy-polling for data availability and acknowledgments.
This architecture enables precise orchestration of data movement, minimizes dispatch latency (to microsecond levels), and matches the bandwidth and concurrency capabilities of CloudMatrix384.
5. FlowServe Serving Engine and System-Level Optimizations
FlowServe, the serving engine at the core of xDeepServe, is redesigned to enable true disaggregation and highly parallel inference flows:
- Data Parallel (DP) Group Abstraction: Each group encapsulates tokenization, API parsing, SPMD execution, caching via the Relational Tensor Cache, and networking (via DistFlow over XCCL) in an isolated pipeline.
- Decentralized Scheduling: Distributed request scheduling and response handling remove global bottlenecks; a slowdown or fault in one DP group has no system-wide impact.
- Latency and Throughput Optimizations: Integration of proactive garbage collection, core pinning, persistent kernels for MoE operations, and expert load balancing via the EPLB method.
- Inference Accelerators: Techniques such as multi-token prediction (MTP) and INT8 quantization are natively supported, further increasing throughput and efficiency.
These architectural features collectively sustain high concurrency, robust fault isolation, and optimized resource allocation for LLM serving.
6. Systemic Challenges and Solutions
xDeepServe directly addresses several challenges endemic to large-scale LLM serving:
- Synchronization and Interference: Disaggregation of attention, FFN, and MoE eliminates intra-layer blocking delays; “trampoline forward” and memory-semantics minimize synchronization bottlenecks.
- Load Imbalance: Real-time expert load metrics and XCCL’s memory-semantic primitives enable responsive token redistribution, limiting tail latency due to MoE imbalance.
- Resource Constraints and Scaling: Independent scaling of compute and memory domains permits tunable deployment footprints matching application SLAs.
- Reliability at Scale: Multi-tier heartbeat, link probing, and distributed scheduling frameworks ensure robust failure detection and fault recovery across hundreds of NPUs.
7. Future Directions
Anticipated research and engineering progress in xDeepServe include:
- Elimination of global synchronization points by evolving toward fully asynchronous, event-driven "dataflow serving."
- Decoupling of DP domains to minimize inference domain interference and further reduce collective delays.
- Extension and optimization of the XCCL protocol for even greater concurrency and lower latency as NPU cluster size grows.
- Continued innovation in quantization, prediction batching, and low-level inference accelerators to sustain performance under increasingly stringent SLAs.
- Advanced fault-tolerance and granular error recovery mechanisms to maintain system integrity despite intermittent hardware or network disruptions.
Summary Table: Core Features of xDeepServe on CloudMatrix384
Feature | Description | Performance/Capability |
---|---|---|
Transformerless Disaggregation | Independent attention, FFN, and MoE modules | Modular scaling, low interference |
XCCL Communication Library | Memory-semantic point-to-point and all-to-all primitives | μs-level dispatch, low overhead |
FlowServe Engine | Decentralized, DP group scheduling and pipeline management | 2400 tokens/s/chip @ 50 ms TPOT |
Disaggregated Prefill–Decode | Distinct compute/memory phase execution, TP matching | Efficient resource utilization |
Disaggregated MoE–Attention | Separate NPUs for MoE experts and attention | Load balancing, minimal contention |
xDeepServe represents a domain-specific leap in scalable, modular model serving, leveraging hardware/network co-design and emerging memory-semantic communication protocols to support the next generation of large-scale, heterogeneous, and latency-sensitive LLM deployments (Xiao et al., 4 Aug 2025).