LLEP: Efficient Load Balancing in MoE Models
- LLEP is an inference-time framework for Mixture-of-Experts models that mitigates imbalanced GPU loads by reallocating tokens without altering trained parameters.
- It utilizes a Least-Loaded Assignment procedure to dynamically redistribute tokens, ensuring bounded compute and memory overhead during inference.
- Empirical evaluations show throughput improvements of up to 6.1× and significant peak memory reductions, enabling efficient deployment of large-scale MoE architectures.
Least-Loaded Expert Parallelism (LLEP) is an inference- and post-training-time framework for Mixture-of-Experts (MoE) transformer models that addresses expert load imbalance under distributed expert parallelism (EP). Unlike train-time routing-balancing methods, LLEP operates without modifying either the router or the trained expert parameters. Its objective is to minimize global inference latency and reduce peak memory across multi-GPU clusters by re-routing and re-allocating workloads from overloaded to underutilized devices, thereby enabling high-throughput deployment of heavily imbalanced MoE architectures (Nguyen et al., 23 Jan 2026).
1. Load Imbalance in Mixture-of-Experts and Limits of Traditional EP
Mixture-of-Experts models route input tokens to a subset (top-K) of a large pool (N) of feed-forward experts, assigning each token to up to K experts based on learned gating scores. Standard Expert Parallelism partitions experts evenly across P devices, and assumes each expert receives approximately equal load.
However, both empirical analysis and recent theoretical work document that, after pre-training—even with strong auxiliary balancing losses—expert utilization can remain highly skewed: global loads ℓ_i (the total count of tokens routed to expert i) can vary by orders of magnitude. Under such imbalanced routing, static EP assignments cause some GPUs to become compute-bound or exceed memory constraints, severely bottlenecking system throughput or even triggering out-of-memory failures during post-training or inference.
LLEP is motivated by the observation that, at inference time, traditional balancing constraints cannot be re-enforced without retraining, but practical deployments require adaptively addressing hardware bottlenecks induced by imbalance (Nguyen et al., 23 Jan 2026).
2. LLEP Algorithmic Structure and Assignment Logic
LLEP operates by analyzing, prior to each MoE layer dispatch, the per-expert global token loads ℓ = [ℓ0, …, ℓ{N−1}]. It then computes the maximum load-to-average ratio . If is below a tunable threshold τ (default τ ≈ 1), LLEP reverts to standard EP. Otherwise, load is reassigned by the Least-Loaded Assignment (LLA) procedure:
- Assign each expert's tokens to its native device up to a per-GPU workload capacity , with α ≥ 1 a slack factor.
- If any expert's assigned batch would send a GPU over , spill excess tokens (in contiguous chunks of ≥ m tokens, where m ensures GEMM efficiency) to the least loaded available GPUs, possibly requiring peer-to-peer transfer of expert weights.
This logic is implemented by sorting ℓ in descending order, successively assigning token slices for each expert to the device with maximal remaining headroom, and tracking both native and spilled token assignments and weight transfer plans. The scheduling process is agnostic to router scores and leaves all inference computations and model semantics unchanged (Nguyen et al., 23 Jan 2026).
3. Complexity, Memory and Latency Considerations
LLEP provides strong theoretical guarantees:
- Each GPU processes at most tokens; total GEMM and memory cost per GPU is immediately bounded as:
- Peak memory per GPU , plus a bounded set of foreign expert weights and temporary buffers.
- Assignment is performed in O(N log N) due to expert sorting; the overall algorithmic overhead (including LLA and token reassignment) is negligible compared to GEMM compute for reasonable batch sizes and cluster scales.
LLEP relies on efficient NCCL all-to-all collectives for token and affinity dispatch, and peer-to-peer GPU memory copy (cudaMemcpyPeer) for weight transfers. Communication and compute are overlapped where possible to minimize latency impact (Nguyen et al., 23 Jan 2026).
4. Empirical Results and Comparative Performance
Controlled experiments on models including GPT-OSS-120B (N=128, D=H=2880, K=4), DeepSeek-V3 (N=256, D=7168, H=2048, K=8), and Kimi-K2 (N=384, D=7168, H=2048, K=8) demonstrate:
- Under extreme imbalance (e.g., 95% of all tokens funneled to a single expert):
- LLEP yields 4.8×–6.1× throughput speedup compared to standard EP.
- Peak per-layer GPU memory is reduced up to 5× (i.e., OOM elimination) (Nguyen et al., 23 Jan 2026).
- Full-model throughput, e.g. in Megatron-Math inference, shows 2.2× speedup (20B) and 1.9× (120B); in fine-tuning, convergence occurs ~1.25× faster. End-to-end accuracy is preserved exactly.
- Ablation indicates that LLEP’s speedup increases as batch size B grows, as does the model width and the scale of imbalance.
LLEP is parameterized by α (slack, typically 1.1–1.5), m (GEMM chunk size, 512–2048 tokens), and τ (imbalance trigger threshold, 1.0–1.2), each of which can be tuned for hardware and workload.
5. Integration and Deployment in MoE Systems
The deployment of LLEP in existing MoE+EP codebases (e.g., DeepSpeed, FSDP) requires:
- Accumulating global per-expert loads immediately after routing.
- Broadcasting ℓ to all devices and running LLA/LLAS locally for assignment and weight plans.
- Replacing the EP dispatch step with LLEP’s spill-aware dispatch mechanism and triggering expert weight transfers.
- Handling forward-pass and backpropagation for both native and foreign experts, including correct accumulation and return of gradients.
- Maintaining a per-GPU cache of foreign expert weights, with appropriate memory eviction strategy (Nguyen et al., 23 Jan 2026).
Recommended best practices include preferring within-node token/eXpert spillover in multi-node clusters to minimize bandwidth overhead, balancing α to reduce transfer overhead while managing load, and enforcing m large enough for GEMM throughput.
6. Comparative Analysis: LLEP vs. BIP-based and Loss-Based Balancing
LLEP is fundamentally an inference-time, throughput-optimizing dispatch and scheduling method, not a routing algorithm. Its primary goal is hardware efficiency under fixed MoE routing. By contrast, BIP-Based Balancing and related train-time methods (Loss-Free, Loss-Controlled) modify or regularize the router to achieve balanced token-to-expert assignment, typically by solving or approximating load-constrained optimization problems and maintaining dual variables during training (Sun, 21 Feb 2025).
- LLEP’s routing is “greedy” with respect to hardware load only, not router affinity scores. This can starve high-performing experts under certain pathological imbalance.
- BIP-Based Balancing is score-aware and optimizes total gating affinity subject to load bounds, achieving near-perfect balance at each step, but incurs additional inference overhead and has been empirically validated only in simulation at pre-training time (Sun, 21 Feb 2025).
- LLEP is computationally lightweight, scales with O(N log N), and can be integrated without retraining or modifying router/expert parameters, making it suitable for serving pre-trained models with arbitrary imbalance patterns.
For deployments where model utility is tied to precise routing, hybrid deployment patterns are possible (e.g., BIP-based balance during early pre-training and LLEP in post-training inference), acknowledging that LLEP introduces no gating-score regularization or balancing loss by design (Sun, 21 Feb 2025, Nguyen et al., 23 Jan 2026).
7. Limitations and Areas for Further Research
LLEP does not address imbalance at train time or improve expert specialization; it responds to fixed routing tables and can only rebalance token and weight placement at the hardware dispatch level. Its effectiveness assumes efficient implementation of all-to-all collectives and fast intra-node weight transfers. Peak memory is theoretically bounded, but in highly overparameterized settings, transient spikes may require tuning or runtime adaptation.
The method has been demonstrated in simulation and large-scale inference/fine-tuning runs, but characterizing its behavior under non-LLM workloads and integrating with dynamic or adaptive routing schemes remains an open problem.
LLEP provides an extensible framework for large, imbalanced MoE model deployment, enabling high-throughput, memory-efficient inference without sacrificing model semantics, as validated by Nguyen et al. (Nguyen et al., 23 Jan 2026).