Papers
Topics
Authors
Recent
2000 character limit reached

Sparse Expert Parallelism in MoE Architectures

Updated 30 December 2025
  • Sparse expert parallelism is defined by dynamically selecting a subset of experts per token in MoE architectures, enabling sublinear per-token computation.
  • It utilizes various parallelism strategies—tensor, expert, context, data, and pipeline—to optimize resource utilization and balance compute, communication, and memory costs.
  • Empirical analyses reveal significant performance improvements, including up to 3× communication savings and high model FLOPS utilization across thousands of GPUs.

Sparse Expert Parallelism refers to distributed training and inference techniques used for Mixture of Experts (MoE) architectures, in which only a subset of a large pool of expert subnetworks ("experts") is activated per token or sample. This design enables scaling up neural models to unprecedented parameter regimes by keeping per-token computation and memory cost sublinear in the number of parameters. Achieving high efficiency on modern hardware, especially across thousands of GPUs or accelerators, requires advanced parallelization strategies that balance compute, communication, and memory—challenges addressed by various forms of expert, tensor, context, data, and pipeline parallelism, as well as hybrid and dynamic mappings known collectively as "MoE Parallel Folding" (Liu et al., 21 Apr 2025, Chen et al., 2023, Huang et al., 11 Sep 2025, Zeng et al., 12 Oct 2025).

1. Fundamental Principles and Parallelism Dimensions

Sparse expert parallelism is defined by the dynamic selection of a small number k≪Ek \ll E of experts (from a total of EE) for each input token, often determined by a learned router. Sparse MoE layers are interleaved with standard "dense" (non-expert) layers. The core challenge is devising parallelization schemes that maximize resource utilization given the mismatch between parallelism dimensions required for dense versus MoE layers.

Key parallelism dimensions are:

  • Tensor Parallelism (TP): Splits large weight matrices along output (or input) dimension across TPTP ranks.
  • Expert Parallelism (EP): Shards the EE experts across EPEP ranks; each rank processes only its local subset of tokens routed to its experts.
  • Context Parallelism (CP): Splits sequence length ("context") across CPCP ranks, each handling a subsequence.
  • Data Parallelism (DP): Replicates model parameters across DPDP ranks, each processing a micro-batch; gradients are aggregated.
  • Pipeline Parallelism (PP): Segments model layers into PPPP pipeline stages, potentially allowing overlapping of compute and communication.

Distinct parallel configurations are possible for dense and MoE layers: world_size=TP×CP×DP×PP=ETP×EP×DP×PP\text{world\_size} = TP \times CP \times DP \times PP = ETP \times EP \times DP \times PP where ETPETP is "expert-tensor parallelism," a TP that applies within each expert (Liu et al., 21 Apr 2025).

2. MoE Parallel Folding: Definition and Mechanisms

MoE Parallel Folding refers to the practice of decoupling the parallel configuration of dense (e.g., attention) layers from that of MoE layers, with minimal coupling constraints (typically only sharing DP and PP dimensions). This allows attention layers to use (TP,CP,DP,PP)(TP, CP, DP, PP) and MoE layers to use (ETP,EP,DP,PP)(ETP, EP, DP, PP), independently optimizing communication and compute for each layer type.

Formally, let:

  • Gattn=TP×CP×DP×PPG_{\text{attn}} = TP \times CP \times DP \times PP
  • Gmoe=ETP×EP×DP×PPG_{\text{moe}} = ETP \times EP \times DP \times PP with PPattn=PPmoePP_{\text{attn}} = PP_{\text{moe}} and DPattn=DPmoeDP_{\text{attn}} = DP_{\text{moe}}. Other parallel dimensions are disjoint and can be independently tuned (Liu et al., 21 Apr 2025).

A central enabler is a token-level dispatcher, which orchestrates the routing of tokens to expert processes, token permutation, communication collectives (e.g., All2All, AllGather, ReduceScatter), and restoration of outputs to their original sequence positions (Liu et al., 21 Apr 2025). This dispatcher supports both "token-dropless" (no token capacity dropped) and "token-dropping" (tokens dropped when experts exceed their fixed capacity) operations. Token-dropless operation guarantees deterministic mapping across all groupings, while sub-sequence dropping uses local logits, incurring negligible extra overhead and no adverse convergence effect (Liu et al., 21 Apr 2025).

The folding approach is foundational to methods such as:

3. Hybrid and Dynamic Parallel Mapping Algorithms

Advanced expert parallelism necessitates hybrid partitioning schemes, especially for non-uniform expert activation ("hot"/"cold" experts, variable batch sizes) and hardware topologies with heterogeneity in communication or compute.

HD-MoE introduces an offline hybrid mapping algorithm, formalized as a linear program (LP), to optimally allocate each expert's compute fraction Pe,dP_{e,d} across devices, balancing compute loads and minimizing communication costs. Constraints ensure each expert is fully mapped ((∀e)∑dPe,d=1(∀e)\sum_d P_{e,d}=1), respect device capacity, and model inter-expert traffic (Huang et al., 11 Sep 2025). Expert matrices can be split across devices ("folded") along the intermediate size (IS) dimension, so "hot" experts use TP (Pe,d>0P_{e,d}>0 across several dd), and rarely-activated experts use pure EP (Pe,d∈{0,1}P_{e,d}\in\{0,1\}).

During inference, an online dynamic scheduler uses "priority detection"—predicting next-layer expert activity, pre-broadcasting weights, and dynamically routing tokens—to further adapt to runtime workload imbalances. Communication volumes are tightly analyzed and empirically shown to be significantly lower (up to 2×2\times–3×3\times communications savings) than pure EP or TP, with compute utilization close to optimal (Huang et al., 11 Sep 2025).

In PPMoE, local index slicing + inner-node all-reduce replaces global All2All, confining expert communication within nodes and integrating seamlessly with pipeline parallelism (Chen et al., 2023).

4. Empirical Performance Analyses

Empirical results validate the effectiveness of sparse expert parallelism and folding schemes across domains and hardware configurations.

  • Megatron-Core MoE Parallel Folding: Achieves up to 49.3% Model FLOPS Utilization (MFU) for Mixtral 8x22B and 39.0% for Qwen2-57B-A14B on H100 GPUs, outperforming baselines (FSDP, TP+EP) by 2–4 MFU points. It scales stably to >>1000 GPUs with minimal MFU degradation (<5%) and maintains high MFU (e.g., 42.9% at 128K sequence length) (Liu et al., 21 Apr 2025).
  • Pipeline MoE: Delivers ≳1.75×\gtrsim1.75\times throughput of conventional MoE (DPMoE) and 90%90\% of per-GPU throughput of a dense backbone 20×20\times smaller, by avoiding inter-node all-to-all and confining MoE communication to fast inner-node all-reduce (Chen et al., 2023).
  • HD-MoE: Demonstrates 1.1×\times–1.8×\times speedup over TP, 1.1×\times–1.5×\times over EP, and 1.0×\times–1.4×\times over compute-balanced hybrid strategies. Node-balance achieves 2×2\times reduction in compute tail latency; dynamic placement with pre-broadcast further boosts speedup (e.g., 1.25×\times for 5 experts) (Huang et al., 11 Sep 2025).
  • HiLoMoE: On CTR benchmarks, delivers an average AUC uplift of 0.2 percentage points and 18.5% reduction in FLOPs over non-MoE baselines, with linear parameter growth and consistent improvement over flat MoE and standard LoRA (Zeng et al., 12 Oct 2025).

5. Implementation Strategies and Dispatcher Algorithms

Implementations require careful orchestration of distributed groups and collective communication patterns. In Megatron-Core, two separate process group sets are initialized:

  • attention_groups: {TP, CP, DP, PP}
  • moe_groups: {ETP, EP, DP, PP}

Layer code uses the appropriate group per layer type for all relevant collectives (AllReduce, ReduceScatter, All2All, etc.). The token dispatcher, critical for MoE layers, executes the following core algorithm (Liu et al., 21 Apr 2025):

1
2
3
4
5
6
7
8
9
10
11
def MoE_Layer_Forward(X_seq):
    routing_logits = Router(X_seq)                             # [batch_local, seq_local, E]
    topk_experts = TopK(routing_logits,K)                      # K experts per token
    indices, perms = PermuteToGroups(topk_experts)             # group tokens by expert
    tokens_for_my_experts = All2All_V(permuted_tokens, EP_group)
    full_expert_tokens = AllGather_V(tokens_for_my_experts, ETP_group)
    out_expert = Local_FFN(full_expert_tokens)
    reduced = ReduceScatter_V(out_expert, ETP_group)
    returned = All2All_V(reduced, EP_group)
    x_restored = InversePermute(returned, perms)
    return x_restored
Backward swaps AllGather and ReduceScatter. The dispatcher supports dynamic tensor shapes and handles both deterministic and sub-sequence token dropping.

For FP8 precision, folding can enable up to 1.3×1.3\times speedup over BF16, with an additional 1.1×1.1\times speedup due to folding itself (Liu et al., 21 Apr 2025).

6. Architectures Beyond Standard MoE: Hierarchical and Hybrid Folding

Variants such as HiLoMoE (Zeng et al., 12 Oct 2025) extend sparse expert folding to domains like CTR prediction by stacking multiple MoE layers, each parameterized as rank-1 LoRA-style updates. Key innovations include:

  • Routing based on prior layer scores rather than outputs, enabling all MoE layers to execute concurrently.
  • Folding all expert updates into a single fused matrix multiplication, replacing deep sequential MoE execution with a one-shot parallel operation.
  • A three-stage training schedule that stabilizes and diversifies experts.

Hybrid folding is also critical in hardware-specific deployments, such as HD-MoE's use of Near-Memory Processing accelerators, where dynamic mapping of expert weights and communication-efficient token routing optimize throughput and link utilization (Huang et al., 11 Sep 2025).

7. Tradeoffs, Limitations, and Future Directions

Sparse expert parallelism and MoE parallel folding unlock substantial hardware efficiency gains but introduce new complexity:

  • Group management and dispatcher implementations must handle dynamic tensor shapes and variable routing per batch.
  • Fully automated offline and online hybrid mapping (as in HD-MoE) can be hardware-specific and require careful tuning and hardware profiling.
  • Strict token-dropless operation simplifies reproducibility, but capacity-dropping remains essential for large, nonuniform sequence processing.
  • Variants such as hierarchical MoE (e.g., HiLoMoE) offer high efficiency but may require new architectural and optimization paradigms.

A plausible implication is that principled folding and hybrid parallelism will remain essential as models scale, hardware topologies diversify, and new application domains demand novel expert routing schemes.


References:

  • "MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core" (Liu et al., 21 Apr 2025)
  • "Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism" (Chen et al., 2023)
  • "HD-MoE: Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing" (Huang et al., 11 Sep 2025)
  • "Hierarchical LoRA MoE for Efficient CTR Model Scaling" (Zeng et al., 12 Oct 2025)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Sparse Expert Parallelism.