Disaggregated Expert Parallelism
- Disaggregated Expert Parallelism is a paradigm that separates attention and feed-forward computations onto distinct hardware groups to improve efficiency.
- It enables independent scaling and optimized scheduling of dense attention layers and sparse expert modules across heterogeneous resources.
- DEP leverages advanced communication protocols, micro-batching, and fine-grained pipelining to reduce latency and boost throughput in large-scale deployments.
Disaggregated Expert Parallelism (DEP) is a paradigm for scaling sparse Mixture-of-Experts (MoE) model inference and training across heterogeneous hardware by physically separating attention and expert (feed-forward) computations onto distinct sets of compute resources. DEP supersedes conventional expert parallelism by enabling the independent scaling, scheduling, and placement of dense and sparse submodules of Transformer models, thereby extracting greater hardware efficiency and flexibility in large-scale deployments (Zhu et al., 3 Apr 2025, Feng et al., 5 Aug 2025, Pan et al., 25 Dec 2025). This article details DEP’s architectural motivation, system models, mathematical foundations, state-of-the-art scheduling and communication protocols, and the algorithmic innovations and performance trade-offs that characterize modern DEP workloads.
1. Architectural Principles and Motivation
Conventional MoE Transformers embed expert parallelism (EP) by partitioning expert modules (FFNs) across devices and using a gating mechanism to route each token to its top- selected experts. In standard EP, all compute groups share both attention and expert computation, which induces suboptimal resource utilization: attention is memory-bound due to large KV caches, while FFNs are compute-bound but underutilized because their per-token batch size is small (Zhu et al., 3 Apr 2025). DEP addresses this by mapping attention modules (and KV caches) exclusively to “Attention Groups” (AGs) and distributing experts to dedicated “Expert Groups” (EGs) (Pan et al., 25 Dec 2025). This partitioning enables:
- Independent scaling of AG and EG according to workload and hardware constraints;
- Specialized hardware allocation: assigning attention to memory-optimized nodes, experts to compute-optimized nodes (e.g., fast HBM GPUs, CPUs, or accelerators);
- Reduced intra-group communication because each group is locally replicated or sharded as appropriate;
- Opportunity for fine-grained, load-balanced parallelism via batching, pipelining, and scheduling.
The resulting design allows for large global batch aggregation at the attention stage and flexible token-to-expert mapping, supporting both homogeneous and heterogeneous resource pools (Zhu et al., 3 Apr 2025, Feng et al., 5 Aug 2025).
2. System-Level Organization and Execution Models
A DEP system is defined by the explicit physical disaggregation of attention and expert submodules and the associated communication model. The canonical DEP architecture includes:
- Attention/Router Cluster (AG): Stores all dense model layers, maintains sequence KV-caches, and issues token-to-expert routing decisions. AG is typically replicated for high-throughput serving (Zhu et al., 3 Apr 2025, Feng et al., 5 Aug 2025).
- Expert Cluster (EG): Hosts a partition of experts, each typically local to a single device, and applies expert-specific FFNs to incoming routed activations.
- Global Controller: Orchestrates cross-cluster scheduling, backpressure, and event management, ensuring smooth routing and execution.
- High-Performance Interconnect: Connects AG and EG, using GPUDirect RDMA, NVLink, or other mechanisms to provide low-latency, high-bandwidth activation transfer (Zhu et al., 3 Apr 2025, Zhu et al., 3 Apr 2025).
Token inference proceeds by alternating (“ping-pong”) between AG and EG: attention layers in AG process incoming sequences, select and route tokens to their assigned experts in EG, experts process their batches, and results are returned for the next layer (Zhu et al., 3 Apr 2025, Pan et al., 25 Dec 2025). State-of-the-art systems (e.g., MegaScale-Infer, FinDEP) optimize for minimal transfer and maximize resource concurrency via micro-batching and fine-grained task partitioning (Zhu et al., 3 Apr 2025, Pan et al., 25 Dec 2025).
3. Mathematical Performance Models and Scheduling
DEP’s runtime throughput and latency are governed by both computation (local GEMM, FFN, gating) and communication (cross-cluster scatter/gather, all-to-all) costs. The critical path for a token through a DEP MoE layer is:
where
- with / the routed activation byte volumes, the interconnect bandwidth, and the network latency (Feng et al., 5 Aug 2025).
A micro-batch pipeline with in-flight tokens achieves steady-state throughput
where is the stage latency, including both compute and communication.
Fine-grained scheduling maximizes resource utilization by partitioning both computation and inter-group transfers into micro-tasks. The FinDEP algorithm splits the AG-side batch dimension into chunks and each micro-batch’s token dimension into chunks per expert—yielding mini-tasks per layer. The optimization schedules these to maximize overlap, subject to device memory, dependency, and resource constraints (Pan et al., 25 Dec 2025).
4. Communication Protocols and Portability
DEP relies on efficient, architecture-agnostic communication primitives due to the non-co-located nature of AG and EG:
- M2N Communication Libraries: Replacing traditional NCCL all-to-all with sparse, direct GPU-to-GPU RDMA primitives, eliminating intermediate CPU copies and group sync overhead (Zhu et al., 3 Apr 2025).
- Control Channel Separation: As in UCCL-EP, only control commands (token routing) traverse PCIe to CPU proxies, which subsequently issue the appropriate GPUDirect RDMA data transfers, preserving high throughput while decoupling GPU/NIC integration (Mao et al., 22 Dec 2025).
- Ordering Guarantees and Backpressure: On unordered networks (e.g., AWS EFA), UCCL-EP uses sequence IDs in RDMA immediate data and receiver-side reordering buffers to enforce token delivery order (Mao et al., 22 Dec 2025).
- Portability: CPU-driven control enables hardware and vendor portability without GPU kernel changes—demonstrated by robust performance on both NVIDIA and AMD GPUs over EFA and Broadcom NICs (Mao et al., 22 Dec 2025).
The adoption of token-level batching and flow-controlled FIFOs further reduces sender congestion and emulates large-scale elastic scaling.
Representative Communication Performance Table
| Platform / NIC | Mode | Dispatch (μs) | Combine (μs) | Speedup |
|---|---|---|---|---|
| NVIDIA H200 + EFA | LL | 220 | 65 | 2.1× |
| NVIDIA H100 + Infini | HT | 85 | 25 | ≈1.0× |
| AMD MI300X + Thor | HT | 105 | 30 | ≈1.0× |
Key: LL = low-latency, HT = high-throughput; speedup is versus prior best baseline on platform (Mao et al., 22 Dec 2025).
5. Routing Algorithms, Load Balancing, and Specialization
Classic MoE routing induces expert load imbalance and substantial redundant communication in DEP due to indiscriminate token routing. Advanced strategies such as Collaboration-Constrained Routing (C2R) enforce co-activation specialization: for each token, after selecting the top-1 expert by gating, subsequent experts are picked only from a specialized Top-T group, defined per expert via co-activation statistics (Zhang et al., 2 Apr 2025).
C2R’s procedure:
- Profile co-activation matrix from a calibration corpus.
- For each expert , define group as Top-T co-activations.
- For token , select , then from .
- Only route tokens to devices hosting .
C2R achieves up to 30% reduction in all-to-all communication, wall-clock time reduction (20–30% over baselines), improved accuracy (+0.33%–0.51% across LLaMA-MoE and Qwen), and better expert load balancing (decreased Gini coefficient from 0.42 to 0.28) (Zhang et al., 2 Apr 2025).
A plausible implication is that group-wise specialization reduces noisy collaboration, further curbing memory traffic and straggler effects in DEP.
6. Pipeline Parallelism, Scheduling, and Latency Hiding
To compensate for increased communication in DEP, state-of-the-art systems employ multi-stage pipelining:
- Ping-Pong Pipelining: Batch partitions are alternately processed by AG then EG, with overlapping micro-batch compute and network transfer to hide communication time. With micro-batches, full overlap is achievable if
where and is network latency (Zhu et al., 3 Apr 2025).
- Fine-Grained Task Splitting: FinDEP’s two-dimensional split enables overlapping AG computation, A2E transfer, EG compute, and E2A return across mini-tasks, fully utilizing hardware at both ends (Pan et al., 25 Dec 2025).
- Dynamic Scheduling: Real-time solvers select optimal to maximize throughput, subject to memory and bottleneck constraints; solver overhead is sub-second even for large systems.
Table: Throughput Results (tokens/s, normalized)
| System | A6000 (Qwen3) | H20 × 32 (DeepSeek) | H20 × 32 (Qwen3) |
|---|---|---|---|
| PPPipe | 21.4k | 120.8k | 61.6k |
| FinDEP | 34.6k | 132.1k | 76.5k |
Up to 1.61× speedup at extreme sequence lengths, 1.24× on large 32-GPU clusters (Pan et al., 25 Dec 2025).
7. Practical Guidelines, Trade-offs, and Deployment
DEP unlocks new performance frontiers but brings its own constraints:
- Shard size: 4–8 experts per shard minimizes load imbalance; larger shards risk stragglers (Feng et al., 5 Aug 2025).
- Batching: Micro-batch size must match link queue depth to hide start-up delays; oversized batches overflow GPU memory (Feng et al., 5 Aug 2025).
- Pipeline Depth: Optimize number of stages to avoid single-cluster bottlenecks; two-stage pipelines suffice for most DEP topologies (Zhu et al., 3 Apr 2025, Feng et al., 5 Aug 2025).
- Network Sizing: Provision bandwidth to keep communication below 20% of critical-path latency (Feng et al., 5 Aug 2025, Zhu et al., 3 Apr 2025).
- Expert co-location: Group experts with high historical co-activation and place them on proximate hardware to further limit cross-domain traffic (Zhang et al., 2 Apr 2025).
- Portability: Favor control-plane communication (CPU proxy) over tight GPU-to-NIC coupling for broad hardware support (Mao et al., 22 Dec 2025).
Trade-offs include the risk of bandwidth saturation at high expert counts, memory pressure at large micro-batch degrees, and diminishing returns from pipelining at network bottlenecks. Real-world deployments must balance accuracy (expert specialization), efficiency, and cost by tuning disaggregation granularity and communication parameters.
References
- "Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design" (Zhang et al., 2 Apr 2025)
- "Frontier: Simulating the Next Generation of LLM Inference Systems" (Feng et al., 5 Aug 2025)
- "MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism" (Zhu et al., 3 Apr 2025)
- "Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism" (Pan et al., 25 Dec 2025)
- "UCCL-EP: Portable Expert-Parallel Communication" (Mao et al., 22 Dec 2025)
- "A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training" (Singh et al., 2023)