Papers
Topics
Authors
Recent
Search
2000 character limit reached

Disaggregated Expert Parallelism in MoE Systems

Updated 2 March 2026
  • Disaggregated Expert Parallelism is a design paradigm that decouples MoE layers into independent components, allowing targeted scheduling and scalable deployment.
  • It leverages dynamic expert routing, operator mapping optimization, and pipelined micro-batching to reduce communication bottlenecks and improve throughput.
  • DEP enhances performance by balancing memory, compute, and bandwidth constraints, resulting in significant speedups under heterogeneous, distributed settings.

Disaggregated Expert Parallelism (DEP) denotes a class of system and algorithmic techniques developed to decouple and independently orchestrate the deployment and scheduling of "expert" parameters and compute operators in Mixture-of-Experts (MoE) architectures across distributed hardware. Originally motivated by the communication, memory, and utilization bottlenecks of conventional expert parallelism as MoE models and cluster scales outgrew single data centers, DEP spans a continuum of design points—including attention-FFN disaggregation, spatial expert partitioning, dynamic expert routing, hybrid parallel strategies, and cross-hardware scheduling. These approaches generalize standard expert parallelism by exposing structured control over expert-to-device mapping, communication patterns, and task overlap, thereby enabling scalable MoE systems under memory, bandwidth, and hardware heterogeneity constraints.

1. System Architectures and Problem Setting

DEP frameworks address the decomposability of MoE layers into attention (dense) and expert (sparse FFN) submodules, leveraging the fact that only a small subset of experts is activated per token (Zhu et al., 3 Apr 2025, Pan et al., 25 Dec 2025). This decomposition allows for disparate parallel and memory strategies for each module:

  • Attention group (AG): Holds all self-attention and gating operators, often replicated for bandwidth-bound KV cache workloads.
  • Expert group (EG): Holds FFN experts, statically or dynamically partitioned across distributed GPU or accelerator nodes.
  • Inter-group communication: Realized as directed collectives (A2E, E2A), all-to-all, or many-to-N (M2N) RPC protocols.

In training, DEP mitigates the cost of expert-parallel all-to-all communication (which grows combinatorially with device and expert counts) by exploiting MoE sparsity, adaptive module placement (e.g., HybridEP's expert domain partitioning (Yang et al., 22 Oct 2025)), and compression of expert migrations. For inference, variants—such as MegaScale-Infer and FinDEP—disaggregate attention and experts for independent scaling and task overlap to maximize throughput, especially under memory-bound and network-limited scenarios (Zhu et al., 3 Apr 2025, Pan et al., 25 Dec 2025).

2. Communication Models and Scheduling Strategies

The communication and computation model in DEP follows from the specialized patterns induced by MoE routing:

  • All-to-All (A2A): Each device must route activations/tokens to the devices holding the selected experts. Standard EP incurs high traffic, which is a key bottleneck in cross-data center deployments (Yang et al., 22 Oct 2025, Liu et al., 10 Feb 2026).
  • All-Gather (AG) / Expert migration: HybridEP and similar algorithms introduce "expert migration," migrating expert parameters instead of data, modeled as one-shot AG collectives that are often cheaper under large domains and sparse activations.
  • Hybrid or M2N Communication: To reduce GPU-idle time, MegaScale-Infer’s M2N protocol and ping-pong pipeline parallelism slice batches into micro-batches and coordinate attention-expert traffic to achieve full overlap of compute and communication (Zhu et al., 3 Apr 2025).

Scheduling within DEP systems often involves mixed-integer programming (MIP), ILP, or custom greedy algorithms to balance device compute, memory, and communication, as in HAP (Lin et al., 26 Aug 2025), HD-MoE (Huang et al., 11 Sep 2025), and operator-level planners (She et al., 12 Mar 2025). Fine-grained pipelining—partitioning each atomic task (attention, expert, communication)—is central to algorithms like FinDEP (Pan et al., 25 Dec 2025), which support shared experts and two-dimensional micro-batching for maximal GPU and bandwidth utilization.

3. Communication-Reducing and Placement Techniques

DEP research has produced multiple strategies to minimize aggregate communication volume, transmission frequency, and latency:

  • Dynamic domain carving: HybridEP computes a fractional parameter p∈[0,1]p \in [0,1] that determines whether data or expert migration is optimal, based on closed-form cost modeling of per-iteration computation and communication. Expert domains are adaptively sized; within domains, AG suffices, while inter-domain, A2A is used (Yang et al., 22 Oct 2025).
  • Collaboration-constrained expert routing: C2R restricts the set of experts activatable per token by learning and then constraining the routing to specialized "cliques," significantly reducing the number of device-to-device transfers per token (Zhang et al., 2 Apr 2025).
  • Operator mapping optimization: Automatic planners (She et al., 12 Mar 2025) assign expert/gate/combine operations to devices for minimal makespan, subject to memory and bandwidth constraints. The formulations natively handle arbitrary, branch-rich MoE topologies.

Table: Selected Communication Minimization Mechanisms

Approach Principle Reduction Mechanism
HybridEP (Yang et al., 22 Oct 2025) Dynamic expert domain sizing Mix A2A with AG, SR compression
C2R (Zhang et al., 2 Apr 2025) Routing specialization Clique-based expert grouping
MegaScale-Infer (Zhu et al., 3 Apr 2025) Ping-pong micro-batching M2N direct GPU-GPU comm.

These methods unlock significant performance gains. For example, HybridEP achieves up to 5.6× better iteration times than conventional MoE training on 32 GPUs over 4 DCs, C2R reduces all-to-all volume by 20–30% and increases throughput by up to 29.3% in Qwen MoE (Zhang et al., 2 Apr 2025), while MegaScale-Infer outperforms next-best MoE serving by up to 1.9× and delivers 4.2–9.9× lower per-transfer latency over NCCL (Zhu et al., 3 Apr 2025).

4. Scheduling Optimization and Task Overlap

Fine-grained scheduling forms the core of modern DEP, as bottlenecks shift from compute to communication with sparsely activated large models on multi-node hardware:

  • FinDEP (Pan et al., 25 Dec 2025): Schedules a 2D pipeline by splitting both AG and EG workload into sub-batches, modeling them with parameterized α–β cost models, and solving an overlap-maximization problem under precedence/resource exclusion. This approach achieves up to 1.61× throughput gains relative to prior coarse ping-pong scheduling.
  • HAP (Lin et al., 26 Aug 2025): Uses an ILP model over a hierarchical search space of parallel plans (EP, TP, DP hybrids per module and per phase—with switching costs). For models such as Mixtral and Qwen-series, it achieves 1.68–1.77× speedups over baseline TP by dynamically choosing optimal module-parallelism per workload phase.
  • Operator-level MIP planning (She et al., 12 Mar 2025): Encodes all gate, expert, and combine nodes in a large MIP, solving for device and channel assignment, operator start/stop, and memory schedules, minimizing pipeline bubbles and balancing compute and memory footprint.

5. Hardware Portability and System Implementations

Disaggregation necessitates portability and efficient interoperation across heterogeneous server, NIC, and accelerator substrates. UCCL-EP (Mao et al., 22 Dec 2025) exemplifies this by:

  • Replacing GPU-initiated RDMA (which is vendor-locked and difficult to port across GPU/NIC pairs) with a GPU–CPU control channel and CPU proxy threads driving GPUDirect RDMA.
  • Emulating advanced ordering semantics (partial fences, sequencing) with RDMA immediate data, supporting unordered transports like AWS EFA.
  • Achieving 2.1× higher dispatch/collective throughput than best prior on commodity EFA, and 45% end-to-end throughput improvement on AMD+Broadcom clusters.

System deployment tradeoffs include CPU proxy scaling, token packing strategies for small message sizes, and flow control under incast or scaling out to hundreds of nodes.

6. Performance Boundaries and Model-Hardware Codesign

The efficiency of DEP depends crucially on hardware topology, interconnect bandwidth, and model structure:

  • Attention-FFN Disaggregation: Extended roofline analysis (Liu et al., 10 Feb 2026) for AFD shows that on commodity clusters, increasing FFN node count enters a "dead zone," where rising operator utilization is counteracted by a collapse in active time due to bandwidth ceilings. AFD loses in efficiency to classical EP except under superpod-class hardware (e.g., GB200/GB300) or for models with coarse experts and low TopK sparsity, where AFD can attain 65–70% HFU, surpassing EP.
  • Near-memory and heterogeneous hardware: HD-MoE (Huang et al., 11 Sep 2025) solves a hybrid mapping via LP+Bayesian optimization, then overlays online scheduling for predicted expert hotspots, supporting fine-grained expert placement on NMP with up to 1.8× speedup over TP and 1.5× over EP.
  • Task overlap and micro-batching: MegaScale-Infer and FinDEP demonstrate that communication-compute overlap, combined with communication-efficient collectives, enables high utilization even under microsecond-scale network latencies.

7. Model Quality, Scalability, and Trade-Offs

Disaggregated expert parallelism achieves significant end-to-end speedups, larger supported models per hardware budget, and lower communication cost at scale without degrading, and sometimes improving, model quality:

  • Empirical results: ScMoE achieves up to 1.49× training and 1.82× inference speedup over standard top-2 MoE, while retaining or even exceeding baseline accuracy (Cai et al., 2024). DeepSpeed-TED supports 4–8× larger base models and 26% speedup on a 40B MoE (Singh et al., 2023).
  • Trade-offs: More aggressive disaggregation can incur load imbalance, straggler effects, or higher memory footprints, necessitating dynamic or probabilistic expert placement and communication-aware scheduling.
  • Generality: DEP architectures, scheduler and communication models are extensible to arbitrary MoE, multi-branch, and hybrid attention architectures, as well as to new domains (e.g., multi-modal, PIM/FPGAs) by reformulating cost models and constraints.

Disaggregated expert parallelism now represents the scalable paradigm for MoE deployment and training at cross-data center, heterogeneous, and massive-scale settings, with a mature ecosystem of algorithmic and systems innovations targeting memory, compute, and communication bottlenecks. The landscape includes closed-form cost-based domain partitioning (HybridEP), operator-level mixed-integer planning, dynamic expert routing specialization (C2R), and pipelined plus communication-optimized system frameworks (MegaScale-Infer, UCCL-EP), all substantiated by strong empirical gains across state-of-the-art MoE models and cluster configurations (Yang et al., 22 Oct 2025, Zhu et al., 3 Apr 2025, Zhang et al., 2 Apr 2025, Pan et al., 25 Dec 2025, Mao et al., 22 Dec 2025, Singh et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Disaggregated Expert Parallelism.