Papers
Topics
Authors
Recent
Search
2000 character limit reached

MegaScale-MoE: Scalable Trillion-Parameter Systems

Updated 24 February 2026
  • MegaScale-MoE is an advanced system that employs multi-dimensional parallelism, specialized expert routing, and microarchitectural optimizations to efficiently train and infer trillion-parameter models.
  • It integrates tensor, expert, context, data, and pipeline parallelism to decouple model scale from per-token computation, dramatically reducing communication overhead.
  • Empirical evaluations show significant throughput improvements and high Model FLOPs Utilization on cutting-edge GPU clusters, underlining its practicality for large-scale deployments.

MegaScale-MoE denotes both the architectural and system-level innovations enabling large-scale, communication-efficient training and inference of Mixture-of-Experts (MoE) models at the trillion-parameter regime. MoE architectures leverage conditional execution to activate only a subset of specialized feed-forward networks (experts) per input token, thus decoupling model scale from per-token computational costs. The MegaScale-MoE system incorporates multidimensional parallelism, hybrid communication-computation overlap, fine-grained dispatcher mechanisms, and microarchitectural optimizations to enable high training and inference efficiency on advanced GPU clusters, with demonstrated scalability to tens of thousands of devices and high utilization on H100/Hopper hardware (Liu et al., 21 Apr 2025, Jin et al., 16 May 2025, Kim et al., 2021).

1. Parallelism Strategies for MegaScale MoE Training

MegaScale-MoE systems are defined by the orchestration of multiple orthogonal axes of parallelism:

  • Tensor Parallelism (TP): Decomposes large matrix multiplications (e.g., attention or MLP) along their output/input columns, sharding weights/activations across TP devices to reduce per-GPU memory. Requires intra-layer all-reduce/all-gather for assembling complete outputs.
  • Expert Parallelism (EP): Distributes experts across GPUs; each device hosts a subset of the total experts, and only receives tokens routed to its experts via All-to-All collectives. For MoE feed-forward layers, only the selected tokens traverse the network, minimizing compute-to-parameter ratio.
  • Context Parallelism (CP): Slices long input contexts (sequences) across multiple GPUs, mitigating activation memory explosion for extreme context windows (e.g., 128K tokens). Sharded keys/values require distributed communication for attention, typically via ring collectives.
  • Data Parallelism (DP): Conventional batch splitting across DP workers; model weights are synchronized, typically by all-reduce, to ensure consistent parameter updates.
  • Pipeline Parallelism (PP): Segmenting the model’s layers into pipeline stages, with pipelined micro-batch execution to overlap communication and computation.

This 5-dimensional scheme is necessary to fully exploit the heterogeneous scaling behavior of MoE Transformer models. MoE layers exhibit extremely low compute/parameter density, necessitating EP for expert distribution, DP for optimizer scaling, TP/CP for layer/context scaling, and PP for deep models (Liu et al., 21 Apr 2025, Kim et al., 2021).

2. MoE Parallel Folding and Group Formation

The MoE Parallel Folding principle recognizes the divergence in optimal parallel group assignments for attention (dense) and MoE (sparse) layers. Rather than enforcing a single group structure, MegaScale-MoE employs:

  • Distinct Group Mappings: Attention groups are constructed over (TP, CP, DP, PP), while MoE layer groups are defined over (TP, EP, DP, PP), allowing PEPP_{EP} and PCPP_{CP} to differ (e.g., PEPP_{EP} can be tuned smaller to fit within a single node, reducing communication).
  • Folding Mechanism: EP is folded inside subsets of the attention axes to ensure the token routing All-to-All operates within tightly coupled (high bandwidth) device groups.

The group assignment functions are:

grpattn(r)=(r mod PTP,⌊r/PTP⌋ mod PCP,⌊r/(PTPPCP)⌋ mod PDP,⌊r/(PTPPCPPDP)⌋ mod PPP)\mathrm{grp}_{\rm attn}(r) = \bigl(r \bmod P_{TP}, \lfloor r/P_{TP}\rfloor \bmod P_{CP}, \lfloor r/(P_{TP}P_{CP})\rfloor \bmod P_{DP}, \lfloor r/(P_{TP}P_{CP}P_{DP})\rfloor \bmod P_{PP}\bigr)

grpmoe(r)=(r mod PTP,⌊r/PTP⌋ mod PEP,… )\mathrm{grp}_{\rm moe}(r) = \bigl(r \bmod P_{TP}, \lfloor r/P_{TP}\rfloor \bmod P_{EP}, \dots\bigr)

FP16/BF16 communication cost between experts is minimized by confining PEPP_{EP} to single-node domains (e.g., PEP≤8P_{EP}\leq8 on NVLink), yielding a per-layer All-to-All cost of only $10$-15%15\% of runtime compared to $50$-70%70\% for large cross-node collectives. This strategy is central to achieving high Model FLOPs Utilization (MFU) in production (Liu et al., 21 Apr 2025).

3. Token Dispatcher and Routing Workflow

The MoE token dispatcher is a core component bridging attention output and MoE expert input via precise, highly parallelized communication steps. Its operation, for each rank, is:

  1. Compute router logits and select Top-K expert indices per token.
  2. Permute tokens by expert assignment into contiguous buffers.
  3. Execute All-to-All-V across the EP group (tokens → experts).
  4. AllGather-V within ETP to replicate tokens along tensor-parallel slices.
  5. Apply local expert feed-forward networks (no further cross-device).
  6. ReduceScatter-V in ETP to redistribute output shards.
  7. All-to-All-V to return outputs to original attention ranks.
  8. Un-permute to restore global token order.

Token-dropping enforces a capacity factor per expert (with sub-sequence dropping as default for overhead reduction), while token-dropless routing simply maintains infinite capacity without token loss. The backward pass mirrors the steps with reversed collective roles (Liu et al., 21 Apr 2025).

4. Communication-Compute Overlap and Compression

MegaScale-MoE addresses the bandwidth-compute disparity in modern accelerators via:

  • Holistic operator (macro-kernel) fusion: Decomposing layers into fine-grained CUDA kernel DAGs and overlapping communication with computation across CUDA streams, exploiting selective activation rematerialization to halve activation memory at negligible MFU loss.
  • Tile-based intra-operator fusion: Partitioning computations into tiles, receiving remote data asynchronously, and consuming tiles as soon as they arrive to pipeline communication and compute.
  • Communication compression: Employing BF16 and FP8 quantization for gradients/activations, leveraging All-to-All with local FP32 accumulation, and dynamic compression strategies (per-token, per-channel) to halve communication volume while maintaining convergence curves indistinguishable from BF16 (Jin et al., 16 May 2025).

5. Empirical Performance and Scaling Behavior

The combination of 5D hybrid parallelism and communication-optimized sublayer orchestration yields significant scaling efficiency:

  • Training Efficiency: MegaScale-MoE achieves up to 1.88×1.88\times throughput improvement over Megatron-LM, with 1.41M tokens/s throughput on a $352$B parameter model ($1,440$ H800 GPUs), and near-linear scaling up to the largest tested GPU counts (Jin et al., 16 May 2025, Liu et al., 21 Apr 2025).
  • Model FLOPs Utilization: 49.3%49.3\% MFU (Mixtral 8×22B, 128 GPUs), 44.9%44.9\% at $1024$ GPUs; for fine-grained Qwen2-57B-A14B, 39.0%39.0\% at $64$ GPUs, 33.4%33.4\% at $1024$ GPUs. Context scaling to 128K tokens sees moderate MFU drop (Mixtral: 47.6%→42.9%47.6\%\to42.9\%) (Liu et al., 21 Apr 2025).
  • Communication Overhead: EP kept intra-node reduces All-to-All to $10$-15%15\% of layer time. ETP collectives are lower cost, especially with small TP.
  • Inference Scaling: MegaScale-Infer disaggregates attention and FFN, employing separate device sets and ping-pong microbatching to hide up to 90%90\% of token dispatch latency; up to 1.90×1.90\times per-GPU throughput over TensorRT-LLM, 3.24×3.24\times throughput-per-dollar for heterogeneous deployment (Zhu et al., 3 Apr 2025).

6. Architectural Guidelines and Design Laws

Multiple empirical and analytical studies underpin sweeping guidelines for MegaScale-MoE configuration:

  • Efficiency Leverage (EL): Defined as EL=Cdense/CmoeEL = C_{dense}/C_{moe} (FLOPs to reach equivalent loss), EL scales as a separable power law in activation ratio AA, compute budget CC, and expert granularity GG (Tian et al., 23 Jul 2025).
  • Optimal Sparsity and Granularity: For maximal EL, use A≲4%A\lesssim4\% (2–4%), G≈10G\approx10–12, and Top-K routing with Ea≈12E_a\approx12. Under good routing load balance, EL exceeds 7×7\times at A≈3%A\approx3\%.
  • Fine-Grained MoE: Empirical evidence strongly supports granularity G≈8G\approx8 (many small experts), improving convergence speed and accuracy versus standard (coarse) MoE at equivalent or lower active parameter count (Krajewski et al., 3 Jun 2025).
  • Load-Balancing Losses: Auxiliary load and importance regularization are universally included to prevent expert collapse or overload.
  • Shared Experts: Inclusion of a minimal number of always-activated shared experts reduces redundancy and increases knowledge sharing, as validated by DeepSeekMoE (Dai et al., 2024).

7. Implementation Practices and Engineering Insights

Key implementation guidelines, as consolidated from multiple MegaScale-MoE systems, include:

  • Parallel Group Construction: Explicitly instantiate separate attention and MoE collective groups in distributed frameworks (e.g., Megatron-Core) (Liu et al., 21 Apr 2025).
  • Dispatcher Tuning: Use sub-sequence dropping, minimal ETP size, and overlap collectives with compute for performance. Co-locate EP within smallest high-bandwidth domain for efficient token exchanges.
  • Resource Allocation: Configure batch, sequence, and expert-to-device mapping to saturate compute with minimal communication. Employ mixed precision and activation checkpointing aggressively.
  • Practical Hardware: Deployment on H100 or Hopper-class GPUs with high-bandwidth NVLink/NVSwitch interconnects achieves the highest utilization and scaling. Non-expert parameters may be quantized to INT4/INT8 for further reduction in memory and inference cost (Kim et al., 2022).
  • Robustness: Best practices include stochastic gating noise for router smoothing, per-expert gradient clipping, straggler detection (per-expert timing), and checkpoint sharding for distributed storage (He et al., 2021).

Table: Major Parallelism Axes in MegaScale-MoE Systems

Axis Primary Role Communication Pattern
Tensor Parallelism Large matrix ops All-Reduce/All-Gather
Expert Parallelism Expert sharding All-to-All/All-Gather within group
Context Parallelism Sequence sharding Ring-All-Reduce/All-Gather
Data Parallelism Batch sharding All-Reduce (global)
Pipeline Parallelism Layer partition Send/recv microbatches

These practices and models are directly validated in the cited works, establishing the blueprints for practical, communication-efficient, and production-ready MegaScale-MoE implementations (Liu et al., 21 Apr 2025, Jin et al., 16 May 2025, Zhu et al., 3 Apr 2025, Krajewski et al., 3 Jun 2025, Dai et al., 2024, Kim et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MegaScale-MoE.