HD-MoE: 3D Co-Design for MoE LLMs
- HD-MoE is a co-design framework combining advanced 3D hardware integration with dynamic scheduling to optimize MoE inference and training in LLMs.
- It employs vertically stacked HBM, reconfigurable systolic arrays, and adaptive mapping to efficiently handle irregular GEMV/GEMM ratios and memory constraints.
- Dynamic fusion scheduling and selective expert placement in HD-MoE achieve up to 2× lower latency, increased throughput, and significant energy savings.
HD-MoE refers to a family of hardware–software co-design solutions and parallel mapping strategies that optimize Mixture-of-Experts (MoE) inference and training for LLMs on advanced memory-centric accelerator systems. In contemporary literature, the term encompasses both 3D vertically integrated hardware architectures for energy-efficient and low-latency expert serving, as well as hybrid and dynamic parallelization algorithms that maximize utilization and throughput on 3D near-memory processing platforms. These innovations address the fundamental bottlenecks of MoE deployment, including highly irregular workloads, unpredictable GEMV/GEMM ratios, memory-bound computation, and distributed dataflow congestion, by tightly coupling accelerator microarchitecture design with runtime-aware expert placement and scheduling strategies (Huang et al., 25 Jul 2025, Huang et al., 11 Sep 2025).
1. Principles and Motivations of HD-MoE
HD-MoE is motivated by three interacting challenges inherent to fine-grained MoE models in LLM serving:
- Irregular GEMV/GEMM Ratios: Inference with MoE models incurs variable arithmetic loads, with shifts between large general matrix-matrix multiplications (GEMM, e.g., prefill and some MoE phases) and many parallelized matrix-vector multiplications (GEMV, e.g., decode and sparse expert routing). Static hardware allocation leads to severe underutilization when the GEMV/GEMM ratio fluctuates.
- Memory and Data Movement Constraints: MoE models demand high DRAM bandwidth and low-latency expert fetches, since expert parameters dominate model size and dispatch patterns are dynamic.
- Suboptimal Pipeline and Compute Overlap: Classical hardware and scheduling strategies for MoE LLMs cannot fuse attention layer computation with expert invocations, resulting in pipeline bubbles and increased queuing latency.
By vertically integrating compute and HBM memory, and adding algorithmic-parallel co-optimization of expert placement, HD-MoE achieves high efficiency under these demanding conditions (Huang et al., 25 Jul 2025, Huang et al., 11 Sep 2025).
2. 3D Heterogeneous Integration and Microarchitecture
HD-MoE systems center on vertically stacked 3D architectures:
- Physical Structure: Multiple tiers consist of HBM DRAM dies (top), an HBM logic die with V-Cache SRAM (middle), and a 7 nm compute die (bottom) hosting reconfigurable systolic arrays and global buffers.
- Interconnect: Dense through-silicon-via (TSV) arrays (>100,000 TSV/mm²) and bumpless hybrid bonds provide >960 GB/s bandwidth per HBM stack, with direct logic-memory coupling that eliminates off-chip serializer-deserializer (SerDes) and NoC router power and latency.
- Energy and Latency Advantages:
- DRAM access energy is approximately halved compared to 2.5D interposer designs.
- On-die round-trip latency drops by 30–50 ns; router area shrinks ~10%.
This physical integration enables all-on-die memory bandwidth scaling with model/array size while maintaining a minimized energy envelope for expert fetches (Huang et al., 25 Jul 2025).
3. Systolic Arrays and Adaptive Compute Allocation
HD-MoE employs dynamic, 3D-stacked systolic arrays that adapt to workload composition:
- Reconfiguration Between GEMM and GEMV: The array can morph at runtime between full GEMM tiles (weight- or input-stationary) and many concurrent GEMVs with efficient V-Cache reuse. This enables (utilization) to stay above 0.8 for sparse/decode-heavy schedules and approach 0.95 under pure GEMM.
- Pipeline Fill Efficiency: Reconfiguration and TSV-enabled broadcasting collapse pipeline fill time from down to cycles.
- High-Throughput Microarchitecture: Core designs (e.g., 16×16 PE arrays at 1 GHz) reach 512 GOP/s per array, with multi-level stacking facilitating aggregate throughput growth.
This architecture maintains >90% array utilization and >95% HBM bandwidth for realistic MoE model and batch scaling regimes, compared to <50% for non-adaptive or 2.5D baselines (Huang et al., 25 Jul 2025).
4. Fusion Scheduling and HBM Access Optimization
HD-MoE co-designs hardware and runtime scheduling:
- Operation Fusion Scheduler (HR-OFS): An on-die, hardware-resource-aware scheduler predicts token/layer arithmetic intensity and aggressively fuses attention calculation (e.g., QKV generation) with MoE expert invocations. This overlap reduces pipeline stalls, with latency reductions up to 65 ms per 256-token prefill (∼40% cut in p99 latency).
- Score-Aware HBM Reduction:
- Even-Odd Expert Placement: High-score experts (gating score ) are fetched in full-precision; others are demoted to FP-8 and mapped to separate DRAM rows. Offline profiling determines per-layer exponent ranges for precision downgrading.
- Bandwidth Efficiency: This scheme yields a 1.35×–1.44× reduction in total DRAM bandwidth for expert weight accesses, matching the top-K gating distribution.
These two mechanisms maintain high hardware efficiency and minimize off-chip communication (Huang et al., 25 Jul 2025).
5. Hybrid and Dynamic Parallelism Mapping
HD-MoE develops algorithmic parallel mapping explicitly for heterogeneous 3D near-memory multiprocessor (NMP) systems:
- Parallelism Types:
- Tensor Parallelism (TP) achieves computational balance by splitting weights, but incurs expensive, persistent all-reduce operations.
- Expert Parallelism (EP) reduces communication but can suffer from severe compute imbalance if expert activation frequencies are skewed.
- Hybrid TP–EP strategies statically assign “hot” experts to TP and “cold” to EP, but lack adaptability to runtime variations.
- HD-MoE Mapping Algorithm:
- Stage 1: An LP-based node-level placement solves for expert-to-node allocation to jointly optimize computation load and communication volume based on activation frequencies ().
- Stage 2: Bayesian optimization (BO) assigns logical groups to physical 2D mesh locations, flattening NoC congestion and communication hot spots.
- Online Dynamic Scheduling:
- Predict next-layer expert hotness; pre-broadcast hot expert weights to candidate nodes.
- Dynamically dispatch tokens only to nodes holding all required experts and choose destination with lowest compute load.
This two-stage, adaptive mapping and scheduling delivers up to 1.8× speedup over TP and 1.4× over static hybrid parallelism, with the dynamic scheduler further adding 1.15–1.25× speedup in realistic batch traces (Huang et al., 11 Sep 2025).
6. Empirical Results and Comparative Performance
HD-MoE has been benchmarked across leading MoE LLMs (OLMoE-1B–7B, DeepSeek-V2-Lite, Qwen-1.5-MoE-A2.7B):
| Metric | NeuPIM (2.5D) | Duplex (2.5D) | HD-MoE (3D) | Speedup vs. NeuPIM | Speedup vs. Duplex |
|---|---|---|---|---|---|
| 99th-pct latency (TBT_p99) | 1.00× | 0.90× | 0.50× | 2.0× | 1.8× |
| Throughput (tokens/sec) | 1.00× | 1.25× | 1.8× | 1.8× | 1.44× |
| Energy per token | 1.00× | 0.60× | 0.25× | 4.0× | 2.4× |
Empirically, HD-MoE hardware–software co-design delivers simultaneous reductions in end-to-end latency (1.8–2× lower p99), increased throughput (1.44–1.8×), and reduced energy per token (2–4×) compared to state-of-the-art 2.5D designs (Huang et al., 25 Jul 2025, Huang et al., 11 Sep 2025).
7. Generalization and Design Guidelines
All fundamental HD-MoE design principles—TSV-dense logic-memory integration, morphable GEMV/GEMM arrays with caching, fusion scheduling, and hybrid parallelism—generalize to multiple classes of transformer models:
- They apply similarly to coarse-grained MoEs, encoder–decoder transformers, and long-sequence transformers, all of which exhibit memory vs. compute imbalances, dynamic routing, and pipeline stalls.
- Realizing comparable gains in future MoE accelerators requires tight vertical integration, dynamic mapping of experts to compute and memory, and hardware/software schedulers that exploit the actual gating distributions and pipeline states (Huang et al., 25 Jul 2025).
A plausible implication is that as scale and workload heterogeneity increase, further gains may derive from increasingly fine-grained, runtime-adaptive fusion and predictive expert-to-tile mapping, extending HD-MoE principles to backward (training) passes and distributed NMP fabrics.
References:
- (Huang et al., 25 Jul 2025) "A3D-MoE: Acceleration of LLMs with Mixture of Experts via 3D Heterogeneous Integration"
- (Huang et al., 11 Sep 2025) "HD-MoE: Hybrid and Dynamic Parallelism for Mixture-of-Expert LLMs with 3D Near-Memory Processing"