Papers
Topics
Authors
Recent
2000 character limit reached

Disaggregated Expert Parallelism

Updated 1 January 2026
  • Disaggregated Expert Parallelism is a paradigm that separates attention and feed-forward computations onto distinct hardware groups to improve efficiency.
  • It enables independent scaling and optimized scheduling of dense attention layers and sparse expert modules across heterogeneous resources.
  • DEP leverages advanced communication protocols, micro-batching, and fine-grained pipelining to reduce latency and boost throughput in large-scale deployments.

Disaggregated Expert Parallelism (DEP) is a paradigm for scaling sparse Mixture-of-Experts (MoE) model inference and training across heterogeneous hardware by physically separating attention and expert (feed-forward) computations onto distinct sets of compute resources. DEP supersedes conventional expert parallelism by enabling the independent scaling, scheduling, and placement of dense and sparse submodules of Transformer models, thereby extracting greater hardware efficiency and flexibility in large-scale deployments (Zhu et al., 3 Apr 2025, Feng et al., 5 Aug 2025, Pan et al., 25 Dec 2025). This article details DEP’s architectural motivation, system models, mathematical foundations, state-of-the-art scheduling and communication protocols, and the algorithmic innovations and performance trade-offs that characterize modern DEP workloads.

1. Architectural Principles and Motivation

Conventional MoE Transformers embed expert parallelism (EP) by partitioning expert modules (FFNs) across devices and using a gating mechanism to route each token to its top-kk selected experts. In standard EP, all compute groups share both attention and expert computation, which induces suboptimal resource utilization: attention is memory-bound due to large KV caches, while FFNs are compute-bound but underutilized because their per-token batch size is small (Zhu et al., 3 Apr 2025). DEP addresses this by mapping attention modules (and KV caches) exclusively to “Attention Groups” (AGs) and distributing experts to dedicated “Expert Groups” (EGs) (Pan et al., 25 Dec 2025). This partitioning enables:

  • Independent scaling of AG and EG according to workload and hardware constraints;
  • Specialized hardware allocation: assigning attention to memory-optimized nodes, experts to compute-optimized nodes (e.g., fast HBM GPUs, CPUs, or accelerators);
  • Reduced intra-group communication because each group is locally replicated or sharded as appropriate;
  • Opportunity for fine-grained, load-balanced parallelism via batching, pipelining, and scheduling.

The resulting design allows for large global batch aggregation at the attention stage and flexible token-to-expert mapping, supporting both homogeneous and heterogeneous resource pools (Zhu et al., 3 Apr 2025, Feng et al., 5 Aug 2025).

2. System-Level Organization and Execution Models

A DEP system is defined by the explicit physical disaggregation of attention and expert submodules and the associated communication model. The canonical DEP architecture includes:

  • Attention/Router Cluster (AG): Stores all dense model layers, maintains sequence KV-caches, and issues token-to-expert routing decisions. AG is typically replicated for high-throughput serving (Zhu et al., 3 Apr 2025, Feng et al., 5 Aug 2025).
  • Expert Cluster (EG): Hosts a partition of experts, each typically local to a single device, and applies expert-specific FFNs to incoming routed activations.
  • Global Controller: Orchestrates cross-cluster scheduling, backpressure, and event management, ensuring smooth routing and execution.
  • High-Performance Interconnect: Connects AG and EG, using GPUDirect RDMA, NVLink, or other mechanisms to provide low-latency, high-bandwidth activation transfer (Zhu et al., 3 Apr 2025, Zhu et al., 3 Apr 2025).

Token inference proceeds by alternating (“ping-pong”) between AG and EG: attention layers in AG process incoming sequences, select and route tokens to their assigned experts in EG, experts process their batches, and results are returned for the next layer (Zhu et al., 3 Apr 2025, Pan et al., 25 Dec 2025). State-of-the-art systems (e.g., MegaScale-Infer, FinDEP) optimize for minimal transfer and maximize resource concurrency via micro-batching and fine-grained task partitioning (Zhu et al., 3 Apr 2025, Pan et al., 25 Dec 2025).

3. Mathematical Performance Models and Scheduling

DEP’s runtime throughput and latency are governed by both computation (local GEMM, FFN, gating) and communication (cross-cluster scatter/gather, all-to-all) costs. The critical path for a token through a DEP MoE layer is:

L=Lcomp+LcommL = L_{comp} + L_{comm}

where

  • Lcomp=Tattn+Tffn+Tgate+maxiTexpiL_{comp} = T_{attn} + T_{ffn} + T_{gate} + \max_i T_{expi}
  • Lcomm=δ+RB+δ+RBL_{comm} = \delta + \frac{R}{B} + \delta + \frac{R'}{B} with RR/RR' the routed activation byte volumes, BB the interconnect bandwidth, and δ\delta the network latency (Feng et al., 5 Aug 2025).

A micro-batch pipeline with mm in-flight tokens achieves steady-state throughput

Ttokens/smmaxs(Ls)T_{\text{tokens/s}} \approx \frac{m}{\max_s (L_s)}

where LsL_s is the stage latency, including both compute and communication.

Fine-grained scheduling maximizes resource utilization by partitioning both computation and inter-group transfers into micro-tasks. The FinDEP algorithm splits the AG-side batch dimension into r1r_1 chunks and each micro-batch’s token dimension into r2r_2 chunks per expert—yielding r1×r2r_1 \times r_2 mini-tasks per layer. The optimization schedules these to maximize overlap, subject to device memory, dependency, and resource constraints (Pan et al., 25 Dec 2025).

4. Communication Protocols and Portability

DEP relies on efficient, architecture-agnostic communication primitives due to the non-co-located nature of AG and EG:

  • M2N Communication Libraries: Replacing traditional NCCL all-to-all with sparse, direct GPU-to-GPU RDMA primitives, eliminating intermediate CPU copies and group sync overhead (Zhu et al., 3 Apr 2025).
  • Control Channel Separation: As in UCCL-EP, only control commands (token routing) traverse PCIe to CPU proxies, which subsequently issue the appropriate GPUDirect RDMA data transfers, preserving high throughput while decoupling GPU/NIC integration (Mao et al., 22 Dec 2025).
  • Ordering Guarantees and Backpressure: On unordered networks (e.g., AWS EFA), UCCL-EP uses sequence IDs in RDMA immediate data and receiver-side reordering buffers to enforce token delivery order (Mao et al., 22 Dec 2025).
  • Portability: CPU-driven control enables hardware and vendor portability without GPU kernel changes—demonstrated by robust performance on both NVIDIA and AMD GPUs over EFA and Broadcom NICs (Mao et al., 22 Dec 2025).

The adoption of token-level batching and flow-controlled FIFOs further reduces sender congestion and emulates large-scale elastic scaling.

Representative Communication Performance Table

Platform / NIC Mode Dispatch (μs) Combine (μs) Speedup
NVIDIA H200 + EFA LL 220 65 2.1×
NVIDIA H100 + Infini HT 85 25 ≈1.0×
AMD MI300X + Thor HT 105 30 ≈1.0×

Key: LL = low-latency, HT = high-throughput; speedup is versus prior best baseline on platform (Mao et al., 22 Dec 2025).

5. Routing Algorithms, Load Balancing, and Specialization

Classic MoE routing induces expert load imbalance and substantial redundant communication in DEP due to indiscriminate token routing. Advanced strategies such as Collaboration-Constrained Routing (C2R) enforce co-activation specialization: for each token, after selecting the top-1 expert by gating, subsequent experts are picked only from a specialized Top-T group, defined per expert via co-activation statistics (Zhang et al., 2 Apr 2025).

C2R’s procedure:

  1. Profile co-activation matrix CNN×NC \in \mathbb{N}^{N \times N} from a calibration corpus.
  2. For each expert ii, define group G(i)G(i) as Top-T co-activations.
  3. For token xx, select e1=argmaxigi(x)e_1 = \arg\max_i g_i(x), then K1K-1 from G(e1)G(e_1).
  4. Only route tokens to devices hosting G(e1)G(e_1).

C2R achieves up to 30% reduction in all-to-all communication, wall-clock time reduction (20–30% over baselines), improved accuracy (+0.33%–0.51% across LLaMA-MoE and Qwen), and better expert load balancing (decreased Gini coefficient from 0.42 to 0.28) (Zhang et al., 2 Apr 2025).

A plausible implication is that group-wise specialization reduces noisy collaboration, further curbing memory traffic and straggler effects in DEP.

6. Pipeline Parallelism, Scheduling, and Latency Hiding

To compensate for increased communication in DEP, state-of-the-art systems employ multi-stage pipelining:

  • Ping-Pong Pipelining: Batch partitions are alternately processed by AG then EG, with overlapping micro-batch compute and network transfer to hide communication time. With mm micro-batches, full overlap is achievable if

m2(1+Tc/Tf)m \geq 2(1 + T_c/T_f)

where Tf=max(Ta,Te)T_f = \max(T_a, T_e) and TcT_c is network latency (Zhu et al., 3 Apr 2025).

  • Fine-Grained Task Splitting: FinDEP’s two-dimensional split enables overlapping AG computation, A2E transfer, EG compute, and E2A return across r1×r2r_1 \times r_2 mini-tasks, fully utilizing hardware at both ends (Pan et al., 25 Dec 2025).
  • Dynamic Scheduling: Real-time solvers select optimal (r1,r2,ma,me)(r_1, r_2, m_a, m_e) to maximize throughput, subject to memory and bottleneck constraints; solver overhead is sub-second even for large systems.

Table: Throughput Results (tokens/s, normalized)

System A6000 (Qwen3) H20 × 32 (DeepSeek) H20 × 32 (Qwen3)
PPPipe 21.4k 120.8k 61.6k
FinDEP 34.6k 132.1k 76.5k

Up to 1.61× speedup at extreme sequence lengths, 1.24× on large 32-GPU clusters (Pan et al., 25 Dec 2025).

7. Practical Guidelines, Trade-offs, and Deployment

DEP unlocks new performance frontiers but brings its own constraints:

Trade-offs include the risk of bandwidth saturation at high expert counts, memory pressure at large micro-batch degrees, and diminishing returns from pipelining at network bottlenecks. Real-world deployments must balance accuracy (expert specialization), efficiency, and cost by tuning disaggregation granularity and communication parameters.

References

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Disaggregated Expert Parallelism (DEP).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube