Papers
Topics
Authors
Recent
2000 character limit reached

Expert Choice Routing in MoE Models

Updated 20 November 2025
  • The paper introduces expert choice routing, where experts select tokens to process, ensuring balanced load allocation and enhanced training efficiency.
  • It employs a three-step process—score computation, expert-side token selection, and dispatch—facilitating dynamic, token-adaptive computation without auxiliary balancing losses.
  • Recent advances integrate continuous rerouting, null experts, and binary decisions to reduce FLOPs and improve performance across language, vision, and generative modalities.

Mixture-of-Experts (MoE) models with expert choice routing constitute a class of large-scale neural architectures that enable vast increases in parameter count and model capacity by sparsely activating a subset of independent “expert” subnetworks for each input. Unlike conventional top-kk token-to-expert routing, expert choice algorithms select the set of tokens to be processed by each expert, inherently balancing computational load and allowing each token to be processed by a variable number of experts. This flexible approach has yielded improvements in both training efficiency and downstream performance across a range of modalities, including language and vision-language tasks. Recent advances further extend MoE systems through continuous and data-free rerouting at inference, dynamic token-dependent routing, and deployment in state-of-the-art multimodal and generative settings.

1. Principles of Mixture-of-Experts Routing

Traditional MoE models employ a gating network that, for each input token xRdinx \in \mathbb{R}^{d_{\mathrm{in}}}, computes affinity scores for NN experts via linear projection,

z=xWgRN,z = x W_g \in \mathbb{R}^N,

and then applies a softmax and top-kk selection:

G(x)i=exp(zi)  1{ziTopK(z,k)}jTopK(z,k)exp(zj).G(x)_i = \frac{\exp(z_i)\;\mathbf{1}\{z_i\in\text{TopK}(z, k)\}} {\sum_{j\in\text{TopK}(z, k)} \exp(z_j)}.

Each token is dispatched to its top-kk experts, resulting in fixed per-token compute graphs and a variable expert load. This regime can produce considerable load imbalance, necessitating auxiliary balancing losses and resulting in over- or under-utilization of specific experts (Zhou et al., 2022).

Expert choice routing inverts the paradigm: experts select their top-CC preferred tokens, with capacity CC often set proportionally to the global batch size and the number of experts. For each expert, the associated affinity vector is sorted over the batch to select the CC highest-scoring tokens,

Ji=argtopC{gij:j=1N}.J_i = \text{arg\,top}_C\,\{g_{ij}: j = 1 \ldots N\}.

A binary assignment matrix AijA_{ij} records the routing decisions. This yields strong theoretical guarantees of per-expert load balance and allows constructively variable numbers of experts per token. Optionally, entropy-regularized or LP-based variants can cap the maximum number of experts per token for explicit control (Zhou et al., 2022).

2. The Expert Choice Method: Formulation and Algorithms

Expert choice routing proceeds via three main algorithmic steps (Zhou et al., 2022):

  1. Score Computation: For each token, compute gating scores gijg_{ij} via a learnable router.
  2. Expert-Side Selection: Each expert ii sorts gi1,...,giNg_{i1}, ..., g_{iN} and selects its top-CC tokens to process.
  3. Dispatch and Combination: Each selected token is processed by all experts that selected it, with their outputs summed or combined by affinity.

This structure supports seamless implementation in both feed-forward and attention-based architectures. In the context of diffusion transformers or multimodal models, expert choice is performed over feature maps or patch representations, e.g., via softmaxed affinities per spatial location, with expert capacities scaled by token count and a capacity factor fcf_c (Sun et al., 2 Oct 2024).

A key attribute is the elimination of explicit load-balancing losses, since each expert is mandated to process a fixed token budget per batch. However, the number of experts processing each token is variable and under gating control—informative or uncertain tokens can receive more compute, while redundant or background tokens may get none.

3. Advances in Adaptive, Online, and Dynamic Expert Routing

Recent advances extend expert choice and related routing strategies beyond static mechanisms, enabling online, data-free adaptation and token-dependent computation.

Continuous rerouting (Su et al., 16 Oct 2025) introduces a two-phase online adaptation for test-time deployment:

  • In a prefill phase, router logits Δz(l)\Delta z^{(l)} in selected layers are adaptively updated via a self-supervised loss on the current context,

L({Δz(l)})=i=1t1logP(xi+1x1:i,{z(l)+Δz(l)}),\mathcal{L}(\{\Delta z^{(l)}\}) = -\sum_{i=1}^{t-1} \log P(x_{i+1} \mid x_{1:i}, \{z^{(l)} + \Delta z^{(l)}\}),

where only the router’s additive parameters Δz(l)\Delta z^{(l)} are updated, with all other weights frozen.

  • In a steered generation phase, the model uses the modified router to generate a fixed number of tokens before the adaptation loop repeats.

Lightweight additive router updates, norm clipping, and selective adaptation only of high-confidence layers prevent overfitting and catastrophic changes. This approach is entirely data-free, leveraging only the model’s own predictions to optimize routing, and achieves both increased robustness to distribution shift and improved downstream metrics (e.g., +5.5% pass@1 on HumanEval for OLMoE).

Token-adaptive routing with null experts (AdaMoE) (Zeng et al., 19 Jun 2024) simplifies expert choice via the inclusion of mm “null experts” that perform identity or zero mapping and consume zero FLOPs. The router now selects k+mk+m experts, of which a variable number per input are true experts. The average FLOPs per token decreases while accuracy often improves, as shown in Mixtral-8x7B—14.5% FLOPs reduction and a +1.69% accuracy increase on ARC-Challenge. A modified load-balancing objective ensures robust usage of null experts without introducing instability.

Dynamic expert routing in multimodal models (e.g., RoE) (Wu et al., 19 Jul 2024) introduces routing tokens and per-layer routers for binary keep-vs-adapter decisions. A structure sparsity regularizer aligns network utilization with example complexity, yielding 15–25% faster inference in vision-language tasks with no accuracy loss.

4. Empirical Performance and Impact

Empirical studies demonstrate the efficiency and performance gain of expert choice and adaptive expert routing methods:

Model Dataset/Metric Baseline Expert Choice/Adaptive Routing Overhead
OLMoE HumanEval pass@1 28.66% 34.17% (+5.5%) (Su et al., 16 Oct 2025) 1.3× FLOPs
DeepSeek-V2-Lite HumanEval pass@1 50.60% 54.26% (+3.6%) -
Mixtral-8x7B AdaMoE ARC-C Acc/Load 87.46% / 2.00 89.15% / 1.67 (+1.69%, -0.33 load) ~0
EC-DIT-3XL-32E GenEval (img-gen) 68.92% 70.91% (+1.99%) (Sun et al., 2 Oct 2024) 1.2–1.3×
RoE-LLaVA 5 VL Benchmarks 60.6% 63.8% (+3.3%) (Wu et al., 19 Jul 2024) Speed ↑

Continuous rerouting introduces measurable expert-pathway shifts (deep-layer edit distances >0.3), decreased router entropy (from 0.369 to 0.284), and magnified probability mass on task-relevant experts. EC-DIT demonstrates perfect per-expert load balance and interpretable token-level compute allocation heatmaps, which reveal adaptive allocation to content-rich or ambiguous spatial regions (Sun et al., 2 Oct 2024).

5. Computational, Practical, and Theoretical Considerations

Expert choice routing achieves per-expert load balance without auxiliary loss terms, theoretically maximizing hardware utilization and accelerating convergence (∼1.1× faster vs. top-1 routing (Zhou et al., 2022)). Nevertheless, it requires global top-CC selection per expert over the batch or sequence, which, while comparable in complexity to per-token top-kk, can pose challenges for autoregressive inference where only the current token is available.

Mechanisms like AdaMoE avoid these implementation bottlenecks and remain compatible with causal decoding by leveraging null experts with minimal parameter overhead (Zeng et al., 19 Jun 2024). Adapter-based and binary routing as in RoE require only small architectural modifications and enable structural sparsity without altering the dense backbone (Wu et al., 19 Jul 2024).

Single-token assignment to zero experts is possible if all affinities are low; practical variants may assign coverage guarantees to prevent this. Capacity control per token or expert is supported in LP-regularized expert choice formulations.

Key hyperparameters include:

  • Number of experts (E) and capacity factor (Cf)
  • Router learning rates and regularization for trainable routers (for online adaptation)
  • Null-expert count mm (AdaMoE), adapter size cc (RoE)
  • Update intervals and norm clipping for safe continuous rerouting

6. Extensions, Limitations, and Future Directions

Possible extensions of expert choice frameworks include:

Limitations of expert choice routing include the intricacies of batched global selection for sequence-wise or autoregressive models and the need for careful tuning of adaptation parameters to avoid overfitting, particularly in continuous rerouting contexts. Nevertheless, these advances in MoE routing achieve both increased expressivity and real-world efficiency at scale.

7. Summary and Synthesis

Mixture-of-Experts models with expert choice routing and its adaptive variants constitute a highly effective paradigm for scaling network capacity and compute allocation. By shifting from fixed, per-token expert assignments to globally balanced, token-adaptive, and dynamically reconfigurable routing, these architectures deliver superior convergence, robust performance under distribution shift, and interpretability via explicit compute allocation. Innovations such as continuous rerouting, token-adaptivity through null experts, and receptive structural sparsity in deep multimodal stacks further generalize the reach of MoE models. As empirical results attest, expert choice routing and its dynamic successors now underpin state-of-the-art approaches across language, vision, and generative modeling (Zhou et al., 2022, Zeng et al., 19 Jun 2024, Sun et al., 2 Oct 2024, Wu et al., 19 Jul 2024, Su et al., 16 Oct 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Mixture-of-Experts with Expert Choice Routing.