Expert Parallelism: Scaling Mixture-of-Experts
- Expert Parallelism is a strategy that distributes experts in Mixture-of-Experts layers across multiple devices to optimize computational and memory resources.
- It addresses communication bottlenecks by optimizing all-to-all transfers through techniques like ER-Mapping and specialized scheduling, reducing latency and load imbalance.
- Recent innovations such as NI-Balancer, LLEP, and hybrid parallel frameworks enhance speed, reduce memory use, and adapt dynamically to hardware and routing constraints.
Expert Parallelism (EP) is a foundational parallelization strategy for scaling Mixture-of-Experts (MoE) architectures, which underpin contemporary LLMs. EP distributes the substantial parameter and compute load of the MoE layers—where only a subset of experts is activated per token—across many devices, permitting resource-efficient scaling to hundreds or thousands of experts. While this approach enables remarkable memory scalability and computational throughput, its efficient implementation is fundamentally constrained by the costs and patterns of all-to-all communication, especially as hardware topologies and routing dynamics evolve. Recent research has produced a diverse toolkit to overcome the classical bottlenecks of EP, introducing optimized mapping, scheduling, hybridization with other forms of parallelism, novel communication primitives, and load-balancing mechanisms.
1. Formal Definition and Core Mechanism
Expert Parallelism distributes the individual experts of a Mixture-of-Experts (MoE) layer across a set of devices. Let be the total number of experts and the number of devices. Each device typically houses experts (often for maximum parallelism). During forward computation, each input token is routed (via a gating network) to its assigned experts (usually top-, ), necessitating the transfer of token activations to the respective devices, followed by local processing, and return of partial outputs.
Mathematically, for device with experts and per-expert parameter size , per-device memory use is constrained by . As and scale, the expert-to-device ratio governs the degree of parallelism and per-device memory demand. Near , each device stores one expert, yielding maximal parallelism and minimal memory stress per device (Tang et al., 29 Oct 2025).
The EP workflow involves:
- Routing: Gating network selects experts per token.
- Dispatch: All-to-all (A2A) collective communicates token activations to the relevant devices.
- Local Compute: Each device processes its local experts on received tokens.
- Combine: A2A gathers and reorders results into original token sequence.
2. Communication Bottlenecks and Topological Constraints
A principal limitation of EP arises from the necessity of all-to-all communication both before and after expert computation. In a -device cluster, with per-device data volume , bandwidth , per-hop latency , and average hop count , the cost for A2A is:
As increases, both and typically increase, with growing in proportion to the number of tokens routed and the number of experts activated per token. This scaling induces superlinear increases in latency, especially prevalent in clusters with non-fully-connected topologies (Tang et al., 29 Oct 2025). On 2D mesh networks (e.g., wafer-scale chips), A2A traffic congests central links and yields imbalanced link utilization, while local all-reduce for attention layers remains low-latency due to short ring paths. This dichotomy generates a "pipeline" of highly asymmetric communication costs within a layer (Tang et al., 29 Oct 2025).
3. Mapping and Scheduling Optimizations
3.1. Entwined Ring Mapping (ER-Mapping)
To alleviate mesh topology bottlenecks, ER-Mapping co-designs the logical mapping of attention and MoE layers:
- Devices are windowed into blocks ( mesh dim), where each block includes a complete set of TP and EP group representatives.
- Full Token Domains (FTDs) become compact and disjoint, reducing hop counts for A2A and eliminating FTD overlap.
- In a mesh, ER-Mapping has shown up to reduction in average A2A communication latency, with measured hops halved compared to baseline mappings (Tang et al., 29 Oct 2025).
3.2. Non-Invasive Balancer (NI-Balancer)
NI-Balancer overlays expert migration onto "cold" network links—those not on the current communication critical path—thereby hiding migration-induced latency. It uses a load-aware, topology-sensitive assignment algorithm that splits migrations into local ("within FTD") and global ("between FTDs") steps, pipelining these with ongoing computation. ROI in practice includes 54% reduction in computation time and 22% reduction in communication time for MoE operations on wafer-scale chips (Tang et al., 29 Oct 2025).
3.3. Automated Hybridization
Hybrid parallel frameworks (e.g., HD-MoE) dynamically optimize device-expert mappings using linear programming, periodically updating placements to minimize compute/comm tail latency. Combined offline and online strategies ("Node Balance + Link Balance") adapt to nonstationary routing statistics and hardware traffic congestion (Huang et al., 11 Sep 2025).
4. Load Balancing and Dynamic Routing
The effectiveness of EP is compromised under skewed or bursty expert activation—common in both post-training and narrow-domain inference. Classical EP assumes balanced routing; however, in realistic deployments, a small set of experts can become statistically "hot," funnelling disproportionate token traffic and inducing OOM or severe straggler effects (Nguyen et al., 23 Jan 2026).
4.1. Least-Loaded Expert Parallelism (LLEP)
LLEP introduces dynamic token and expert re-allocation at inference (or fine-tuning) time:
- Token workloads for overloaded experts are partitioned and redistributed to less-utilized devices.
- Corresponding segments of expert weights are transiently migrated and recombined to support distributed computation.
- LLEP maintains per-device memory within strict bounds and synchronizes device completion with minimal tail latency.
- Empirical results show up to speedups and reduction in peak memory use vs. standard EP under adversarial token routing (Nguyen et al., 23 Jan 2026).
4.2. Fully Sharded and Adaptive EP
Innovations such as Fully Sharded Expert Parallelism (FSEP, as in LAER-MoE) enable on-the-fly expert re-layout by sharding expert weights across all devices and reconstructing only active experts on demand. Device-level planners, leveraging lightweight combinatorial optimization, ensure per-device token and compute loads remain balanced per iteration, while fine-grained scheduling overlaps comm/comp to minimize critical-path time (Liu et al., 12 Feb 2026).
5. Hybrid, Specialized, and Communication-efficient Schemes
5.1. HybridEP and Cross-domain Extensions
HybridEP dynamically mixes (A2A) token communication and (AG) expert migration, guided by an analytic stream-based model, to minimize communication overheads under constrained cross-datacenter bandwidth (Yang et al., 22 Oct 2025). It partitions devices into domains, leverages parameter-compression in expert transfer, and orchestrates asynchronous migration. Empirical results show up to training speedup with no degradation in model quality in cross-DC scenarios.
5.2. Specialized Routing and Clustering
C2R (Collaboration-Constrained Routing) constrains possible expert groupings per token, leveraging empirically measured expert collaboration matrices to co-locate often-coactivated experts and minimize multi-device redundancy. This technique reduces all-to-all time by 20–30% and delivers up to 25% wall-clock speedup without accuracy cost (Zhang et al., 2 Apr 2025). Similarly, speculative token/expert grouping (s-MoE) leverages offline routing statistics to pre-schedule tokens and experts, boosting "locality" and trimming unnecessary inter-device traffic (Li et al., 6 Mar 2025).
6. Hardware-aware, Pipeline, and Communication Layer Advances
Modern EP implementations tightly integrate with hardware topology. On wafer-scale chips, entwined mapping and migration schedules are tailored to mesh layouts (Tang et al., 29 Oct 2025). On clusters with hierarchical fabric, communication layers (e.g., UCCL-EP) mediate high-volume, fine-grained A2A through CPU proxies, delivering DeepEP-level throughput across heterogeneous GPU/NIC hardware, with up to 2.1× throughput improvements relative to conventional collectives (Mao et al., 22 Dec 2025).
Pipeline-based strategies, as in EPS-MoE, dynamically select efficient GEMM kernels per expert-load and pipeline A2A collectives with computation, approaching linear scaling in "prefill" throughput and achieving up to 28% per-layer speedup over baseline approaches (Qian et al., 2024). Disaggregated expert parallelism (MegaScale-Infer) decouples attention and expert modules, applies distinct parallelism, and leverages efficient many-to-many (M2N) communication for high throughput inference serving (Zhu et al., 3 Apr 2025).
7. Comparative Performance and Outlook
The effectiveness and relative benefits of EP and its descendants depend on workload patterns, batch size, model sparsity, expert-to-device ratios, and hardware constraints.
| Approach | Key Mechanism | Relative Throughput Gains | Typical Use-case |
|---|---|---|---|
| Classical EP | All-to-all dispatch/combine | Baseline; up to 1.5× slower vs. HD-MoE (Huang et al., 11 Sep 2025) | Standard distributed MoE, balanced routing |
| ER-Mapping + NI-Bal. | Topology-aware mapping/balance | Up to 62% latency, 54% compute, 22% comm. reduction (Tang et al., 29 Oct 2025) | WSC mesh, memory-bound scaling |
| LLEP | Dynamic workload migration | Up to speedup under skew (Nguyen et al., 23 Jan 2026) | Skewed routing, domain adaptation |
| FSEP (LAER-MoE) | Sharded experts, re-layout | 1.7× end-to-end speedup (Liu et al., 12 Feb 2026) | Large-scale, dynamic/gated MoE |
| HybridEP | Mix A2A and AG comm. | Up to 5.6× speedup cross-DC (Yang et al., 22 Oct 2025) | Cross-datacenter, bandwidth-limited |
| C2R | Collab.-constrained groups | 20–30% A2A reduction, +0.5% acc. (Zhang et al., 2 Apr 2025) | Communication-bound inference, high expert count |
| S-MoE (speculative) | Routing prediction, co-group | Up to 75% comm. saving, 1.7–2.4× throughput (Li et al., 6 Mar 2025) | High-throughput, batched inference |
| EPS-MoE (pipeline) | Kernel select, overlap, split | Up to 28% per-layer speedup (Qian et al., 2024) | High-throughput inference, heterogeneous workloads |
A plausible implication is that the ongoing evolution of hardware fabric (e.g., wafer-scale chips, CXL composable systems), combined with increasingly dynamic expert selection in MoE models, will further intertwine comm/comp co-design, necessitating algorithmic flexibility in EP realization—ranging from fine-grained sharding and mapping, adaptive communication, to advanced routing strategies. Outstanding questions include optimal adaptation to changing mesh topologies, the translation of these principles to MoE training in addition to inference scenarios, and full automation of mapping, partitioning, and kernel selection for arbitrary workload/hardware pairs (Tang et al., 29 Oct 2025).
References
- "MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference" (Tang et al., 29 Oct 2025)
- "HD-MoE: Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing" (Huang et al., 11 Sep 2025)
- "Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts" (Nguyen et al., 23 Jan 2026)
- "LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training" (Liu et al., 12 Feb 2026)
- "HybridEP: Scaling Expert Parallelism to Cross-Datacenter Scenario via Hybrid Expert/Data Transmission" (Yang et al., 22 Oct 2025)
- "Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design" (Zhang et al., 2 Apr 2025)
- "Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling" (Li et al., 6 Mar 2025)
- "EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference" (Qian et al., 2024)
- "UCCL-EP: Portable Expert-Parallel Communication" (Mao et al., 22 Dec 2025)
- "MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism" (Zhu et al., 3 Apr 2025)