Expert Parallelism: Scaling Sparse MoEs

Updated 5 February 2026

Expert Parallelism is a distributed computation paradigm that assigns each MoE expert to specific devices, enabling scalable and memory-efficient sparse LLMs.
The approach orchestrates dispatch, compute, and gather phases through irregular, token-driven all-to-all communications, optimizing performance despite inherent overheads.
Challenges such as communication bottlenecks and load imbalance drive innovations like hybrid parallelism and dynamic load balancing to improve system efficiency.

Expert Parallelism (EP) is a distributed computation paradigm central to scaling Mixture-of-Experts (MoE) neural models—particularly sparse MoE LLMs—beyond single device memory and bandwidth limits. EP refers to allocating each expert (parameter tensor or sub-network) to one or more devices, so that at runtime, only tokens routed to those experts are communicated to and processed by the owning device(s). The approach enables massive parameter and computation scaling while keeping per-device memory usage manageable. EP is now a mainstay of both GPU/NPU-based MoE system design and the broader field of communication-efficient, sparse-activation distributed deep learning.

1. Formal Model of Expert Parallelism

Let $E$ denote the number of experts, $D$ the number of devices, and $B$ the batch size. In EP, a device-assignment function $f:\{0,\ldots,E{-}1\} \to \{0,\ldots,D{-}1\}$ specifies expert placement (without splitting the experts themselves). The indicator $P_{ic}=1$ iff expert $i$ is hosted on device $c$ , else $0$. Each input token is routed to its top- $k$ experts by a learned router, and sparse all-to-all communication gathers the required token activations onto the appropriate devices for computation.

The computational load on device $c$ is

$D$ 0

where $D$ 1 is the (possibly non-uniform) relative activation frequency of expert $D$ 2. The total compute time is that of the straggler node:

$D$ 3

with $D$ 4 the device throughput (Huang et al., 11 Sep 2025).

The core communication pattern is an irregular, batch-dependent all-to-all: each device owning expert $D$ 5 must receive all tokens routed to $D$ 6, then (optionally) send processed outputs back to token-origin devices. The communication cost for EP per device is dominated by

$D$ 7

where $D$ 8 is the set of expert-groups (one group per token of size $D$ 9) and $B$ 0 is the joint activation frequency for group $B$ 1. (Huang et al., 11 Sep 2025)

2. Key Workflow: Dispatch, Compute, Gather

The canonical EP pipeline (in MoE inference and training):

Router step: Each input token computes its top- $B$ 2 expert indices (typically local, batch-parallel).
Dispatch (all-to-all): Tokens slated for each expert are sent to the corresponding device in an irregular, sparse all-to-all collective.
Grouped GEMM: Each device processes all received token-activations using its locally stored expert parameters.
Gather (all-to-all): Expert outputs are sent back to token-origin devices for further aggregation or the next model step.

The all-to-all communication is the critical scalability bottleneck: both in bandwidth ( $B$ 3 messages) and in the irregularity of load distribution ("hot" experts/devices induce stragglers). (Huang et al., 11 Sep 2025, Cui et al., 4 Feb 2026, Tang et al., 29 Oct 2025, Li et al., 6 Mar 2025)

3. Communication and Utilization Bottlenecks

EP excels at minimizing per-node memory usage since each device only retains its assigned expert weights. However, it exhibits distinctive limitations on real hardware:

Communication bottleneck: The two sparse all-to-all operations (dispatch and gather) carry bandwidth costs scaling as $B$ 4 per layer, linearly in the number of activated experts $B$ 5 (Cui et al., 4 Feb 2026). Under mesh or inter-node topologies, the non-local traffic patterns can saturate links, particularly as $B$ 6 grows. Empirically, all-to-all overhead can consume over half of inference latency in large-scale MoE models (Li et al., 6 Mar 2025, Tang et al., 29 Oct 2025).
Load imbalance / stragglers: Because expert popularity (activation frequency $B$ 7) is highly skewed and can burst between iterations, devices hosting "hot" experts may receive disproportionate loads—often exceeding 50% of tokens routed to a single expert/advisor (Huang et al., 11 Sep 2025). This induces under-utilization (other devices idle) and tail-latency, sometimes dropping net device utilization below 50% (Nguyen et al., 23 Jan 2026).
Dynamic routing instability: Each batch yields different token→expert assignments, causing per-batch variability in device utilization and communication patterns (Huang et al., 11 Sep 2025, Nguyen et al., 23 Jan 2026).

These limitations motivate both architectural and algorithmic innovations in EP implementations.

4. Design Variants and Algorithmic Extensions

Contemporary research addresses EP's scalability and performance through variants and add-ons:

Hybrid Parallelism (TP–EP/HybridEP/HD-MoE): Hot experts (high $B$ 8) are split across devices as in tensor parallelism (TP) to mitigate "straggler" bottlenecks, while cold experts remain pure EP. HD-MoE (LP + BO algorithm) optimizes a continuous partitioning, then applies topology-aware mapping (Huang et al., 11 Sep 2025). HybridEP further integrates expert/data migration guided by a stream model, switching between token-sending (A2A) and expert-shipping (AG) depending on bandwidth constraints (Yang et al., 22 Oct 2025).
Dynamic Routing/Load Balancing: Approaches like Least-Loaded Expert Parallelism (LLEP) (Nguyen et al., 23 Jan 2026) detect extreme routing skew at runtime and actively spill expert loads/chunks to less utilized devices, aiming for near-perfect load balance within per-device memory constraints. The METRO algorithm in (Yu et al., 10 Dec 2025) routes tokens to minimize the number of activated experts per GPU, optimizing memory traffic for memory-bound decode phases.
Topology- and Scheduling-Aware Mapping: ER-Mapping entwines attention TP groups and MoE expert placement to reduce mean hop count and link congestion on wafer-scale meshes, while the Non-Invasive Balancer (NI-Balancer) migrates experts over "cold links" in a way that hides migration cost, improving both computation and communication latencies (Tang et al., 29 Oct 2025).
Communication Layer Abstractions: UCCL-EP (Mao et al., 22 Dec 2025) replaces tight GPU–NIC RDMA coupling with portable, CPU-mediated high-throughput dispatch, achieving comparable performance and higher portability across GPU/NIC hardware.
Speculative/Pre-scheduled Communication: Speculative MoE (Li et al., 6 Mar 2025) leverages token–expert popularity profiles to pre-shuffle tokens and pre-group experts, increasing the fraction of local activations and thereby reducing all-to-all volume by up to 66%.

5. Empirical Performance and Comparative Results

Strategy	Key Bottleneck Addressed	Noted Speedup	Memory Scaling	Reference
EP (baseline)	Memory per device	–	$B$ 9	(Nguyen et al., 23 Jan 2026, Huang et al., 11 Sep 2025)
HD-MoE (hybrid)	Stragglers, comm. congestion	1.1×–1.5× over EP	Improved with split	(Huang et al., 11 Sep 2025)
HybridEP	Cross-DC comm. bottleneck	Up to 5.6× over EP	–	(Yang et al., 22 Oct 2025)
LLEP	Routing imbalance	Up to 5× over EP	4×–5× lower peak	(Nguyen et al., 23 Jan 2026)
ER-Mapping/NI	Mesh comm. hops; migration	+39% MoE capacity	–	(Tang et al., 29 Oct 2025)
Speculative MoE	Comm. volume per layer	1.7×–4.3× over EP	–	(Li et al., 6 Mar 2025)
METRO	Decode memory traffic	Up to 4.11×	Fewer experts act.	(Yu et al., 10 Dec 2025)
Multi-Head HP	$f:\{0,\ldots,E{-}1\} \to \{0,\ldots,D{-}1\}$ 0-scaling, determinism	1.61× over MoE EP	Flat in $f:\{0,\ldots,E{-}1\} \to \{0,\ldots,D{-}1\}$ 1	(Cui et al., 4 Feb 2026)

Experimental results consistently demonstrate significant speedups (often 1.2–5×) over pure EP in latency, throughput, or peak memory, conditional on the communication or load-balancing technique and hardware topology.

6. Practical Implementation and Portability

EP systems expose practical challenges in efficient collective implementation:

Fine-grained, irregular collectives: Small message sizes, data-dependent routing, and tight per-token ordering semantics force deviations from conventional batched collectives. Systems such as DeepEP (Mao et al., 22 Dec 2025) pioneered GPU-initiated, token-level RDMA, later generalized by UCCL-EP into a CPU-proxy model for greater hardware agnosticism.
Scheduling for memory hierarchy: In decode or memory-bound regimes, performance is sensitive not to tokens/GPU, but to activated experts/GPU, due to HBM read bandwidth. Algorithms that minimize activated expert count (e.g., METRO) provide up to 22% latency reductions (Yu et al., 10 Dec 2025).
Mapping and migration: On modern mesh-based clusters, traffic-aware mapping (ER-Mapping), dynamic expert migration (NI-Balancer), and speculative scheduling (Speculative MoE) help maintain balanced network utilization and hide expensive data movement (Tang et al., 29 Oct 2025, Li et al., 6 Mar 2025).
Hybrid parallelism: Systems such as HD-MoE automatically partition "hot" experts (splitting via TP) and "cold" experts (whole via EP), implemented by solving a linear program followed by Bayesian optimization to match topology (Huang et al., 11 Sep 2025).

7. Applications and Generalizations

EP is the standard parallelization scheme for training and inference of large, sparse MoE LLMs, including Mixtral, DeepSeek, Qwen3, and multi-decoder systems. It is compatible with 3D-parallel frameworks that combine data, tensor, and expert parallelism (e.g., DeepSpeed-MoE, Megatron, SGLang) (Li et al., 6 Mar 2025). Recent works explore extensibility to cross-DC training (HybridEP), wafer-scale devices (MoEntwine/ER-Mapping), and hardware-portable RDMA protocols (UCCL-EP). EP is also foundational for benchmarking the efficiency of alternative parallelization schemes, such as Head Parallel or domain-based expert routing (Cui et al., 4 Feb 2026).

In summary, Expert Parallelism enables the efficient scaling of sparse MoE networks across modern accelerator hardware by focusing on per-expert weight locality and dynamic, token-driven computation, but faces inherent challenges in communication overhead and load imbalance. These are the subject of extensive, ongoing systems and algorithmic innovation across hardware and software stacks (Huang et al., 11 Sep 2025, Mao et al., 22 Dec 2025, Yu et al., 10 Dec 2025, Nguyen et al., 23 Jan 2026, Cui et al., 4 Feb 2026, Li et al., 6 Mar 2025, Tang et al., 29 Oct 2025, Yang et al., 22 Oct 2025).