ER-Mapping: Efficient MoE Inference Design
- ER-Mapping is a network-aware design that entwines attention rings with MoE all-to-all domains to reduce communication hops and latency.
- It minimizes multi-hop transfers and congestion by optimally mapping attention all-reduce operations and MoE full token domains within a mesh topology.
- Experimental results show up to 62% reduction in communication latency and a 37% improvement in throughput, confirming its efficiency for large-scale MoE models.
Entwined Ring Mapping (ER-Mapping) is a network-aware placement and communication co-design for expert-parallel Mixture-of-Experts (MoE) inference on mesh-connected wafer-scale chips (WSCs). Addressing the fundamental bottleneck of all-to-all communication in MoE architectures, ER-Mapping jointly optimizes the mapping of attention all-reduce rings and MoE all-to-all domains (Full Token Domains, FTDs) to minimize multi-hop transfers and balance link utilization, achieving substantial reductions in communication latency and congestion. The approach exploits the complementary pattern of “hot” and “cold” links across attention and MoE phases, directly enabling efficient expert migration through the Non-Invasive Balancer (NI-Balancer) (Tang et al., 29 Oct 2025).
1. Motivation and Context
Expert-parallel MoE inference involves dispatching input tokens to expert parameters scattered across devices and collecting results via all-to-all communication. On commodity GPU clusters, restrictive inter-node bandwidth and high communication overheads severely impede scalability. Wafer-scale chips integrate hundreds of compute dies via a high-bandwidth, low-latency two-dimensional mesh, promising one-expert-per-device mapping (E/D ratio down to 1) for large-scale MoE models. However, the mesh topology imposes multi-hop paths for all-to-all exchanges, leading to localized congestion, especially at the center of the wafer, and limiting achievable throughput.
In contrast, attention layers use all-reduce operations (reduce-scatter and all-gather) that can be efficiently mapped onto local rings. The primary insight of ER-Mapping is that careful placement—“entwining” attention rings with MoE FTDs—can preserve low-latency attention all-reduce while drastically reducing MoE all-to-all hop count and mesh congestion. By accepting a modest increase in attention all-reduce path length (which is typically masked by computation), a significant latency drop in MoE all-to-all can be obtained (Tang et al., 29 Oct 2025).
2. Formal Model and Objectives
The WSC consists of devices arranged in an mesh. Each MoE layer is parallelized into:
- tensor-parallel (TP) groups for attention, each of size ,
- experts for MoE, ideally (one expert per device).
Let denote the device-to-attention-ring assignment, and the partitioning of devices into FTDs, each of size .
Communication latency for a message of volume over a link is:
where is per-link bandwidth and is link startup latency.
The cost objectives are:
- : total all-reduce latency across rings,
- : total all-to-all latency across FTDs.
The joint objective:
with to prioritize MoE all-to-all minimization.
Subject to the constraints:
- Each device participates in exactly one attention ring (TP group),
- Each device belongs to exactly one FTD (all-to-all group),
- FTD size is fixed (),
- Load balancing: per-device compute average.
3. ER-Mapping Construction and Algorithmic Workflow
ER-Mapping seeks two interlocking mappings:
- Local grouping of TP groups to minimize the spatial extent of FTDs,
- Overlapping (entwined) construction of attention rings such that communication domains are compact and have minimal intersection.
Data Structures and Definitions:
mesh[X] [Y]: grid of device IDs,group_id[x] [y]: TP group at ,FTD_id[x] [y]: FTD index,ring_neighbors[group]: ordered devices in each attention ring.
Pseudo-code for even, (generalization indicated):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
for i in [0 .. X/2-1], j in [0 .. Y/2-1]:
// define 2x2 FTD at (i,j)
base_x = 2*i, base_y = 2*j
assign mesh[base_x+0][base_y+0] -> FTD_id = 4*i + 2*j + 0
assign mesh[base_x+0][base_y+1] -> FTD_id = 4*i + 2*j + 1
assign mesh[base_x+1][base_y+0] -> FTD_id = 4*i + 2*j + 2
assign mesh[base_x+1][base_y+1] -> FTD_id = 4*i + 2*j + 3
for each FTD f:
// place one device from each TP group in a checkerboard
for each TP group g:
ring_neighbors[g] = loop across devices of group g, ring of length two hops, may cross FTDs
return group_id, FTD_id, ring_neighbors |
4. Complementary Link Usage and Non-invasive Balancer
Attention all-reduce and MoE all-to-all patterns utilize disjoint mesh links. During attention computation, only the links forming the small entwined attention rings are active (“hot”); the remainder are idle (“cold”). During MoE phases, local FTD blocks and inter-FTD links become hot, and rings are cold.
This complementary link activity enables “opportunistic” utilization for expert state migration. The NI-Balancer divides migration into small, non-blocking substeps, each using available cold links—performing local migrations inside FTDs when attention rings are inactive and using global links when MoE communication is idle. This pattern ensures migration overhead is hidden from critical path communication (Tang et al., 29 Oct 2025).
5. Performance Evaluation
Measured on an WSC:
- Baseline (corner-based mapping) all-to-all latency per MoE layer: 4.8 ms,
- ER-Mapping (with 2×2 FTDs, non-overlapping) reduces all-to-all to 1.8 ms (62% reduction),
- Combined attention + all-to-all communication: up to 62% reduction,
- Qwen3-234B MoE, EP=128: end-to-end throughput, baseline 1.0 tokens/ms, ER-Mapping 1.37 tokens/ms (+37%),
- Versus DGX-based clusters (4x8 GPUs): 73% higher throughput for token counts ≥ 256,
- Multi-wafer (4 WSC) deployment with hierarchical ER-Mapping sustains 54–62% speedup across expert/tensor-parallel configurations.
6. Deployment and Implementation Guidelines
Communication Primitives: Use two-hop ring all-reduce (reduce-scatter, all-gather) for attention via entwined rings, and -node all-to-all within each FTD using precomputed routing.
Asynchrony: Utilize three hardware streams per device—
- attention-stream: overlaps ring all-reduce and compute,
- MoE-stream: overlaps FTD all-to-all and MoE compute,
- migration-stream: fires expert migrations during link idle periods.
Parameter Settings:
- FTD size should equal the number of TP groups (one device from each group per FTD),
- Mesh axes must be multiples of ,
- For irregular mesh or arbitrary , use block padding or hierarchical mapping (intra-chip then inter-chip).
Limitations:
- Slight increase in attention all-reduce path length (rings ≈ 2× longer)—offset by compute,
- Mesh/fabric size must permit block partitioning,
- Load-balancer thresholds (, no-migration interval ) require tuning to avoid excessive migration.
7. Summary and Significance
ER-Mapping addresses the core challenge of scalable, efficient MoE inference on mesh-based WSCs by tightly entwining the placement and communication patterns of attention and expert-parallel domains into compact neighborhoods. This restructuring halves typical hop counts, alleviates central congestion, and realizes up to 62% reduction in communication overhead. Associated mechanisms such as NI-Balancer further support efficient expert migration without disrupting primary compute or communication. Experimental evidence shows up to 39% higher per-device throughput relative to NVL72 supernodes, confirming the scalability and efficiency of the approach for next-generation large-scale LLMs on wafer-scale mesh topologies (Tang et al., 29 Oct 2025).