Token-Based Mixture-of-Experts Models

Updated 25 February 2026

Token-based Mixture-of-Experts models are neural architectures that route each input token to a sparse, specialized subset of expert networks.
They employ top-K gating and auxiliary load-balancing losses to optimize performance while maintaining efficient computation.
Variants like MoGE, MoE++, and AdaMoE further enhance scalability, hardware utilization, and routing stability in complex systems.

Token-Based Mixture-of-Experts (MoE) Models

Token-based Mixture-of-Experts (MoE) models are a family of neural architectures that assign each input token in a sequence to a sparsely activated subset of expert neural networks. By activating only a small fraction of possible experts per token—commonly using a routing mechanism over a large expert pool—these models achieve a substantial scaling of parameter count and expressivity while maintaining constant or only modestly increased computation per token. Token-level routing dramatically improves efficiency, parallelism, and specialization in large models, underpinning many state-of-the-art systems in natural language processing and beyond.

1. Architectural Principles and Routing Mechanisms

Token-based MoE models replace the conventional dense feed-forward sublayer of a transformer block with a set of $N$ expert networks $\{E_i\}$ , typically multilayer perceptrons or specialized submodules. Each token’s hidden state $h$ is processed by a router, usually a trainable linear projection $W\in\mathbb{R}^{d\times N}$ , producing routing scores $z=W^\top h\in\mathbb{R}^N$ (Tang et al., 27 May 2025, Kang et al., 26 May 2025).

The standard top- $K$ sparse gating protocol is as follows:

Apply a top- $K$ operation to $z$ , keeping the largest $K$ entries (remaining entries set to $-\infty$ ), producing $\{E_i\}$ 0.
Normalize using softmax: $\{E_i\}$ 1.
The MoE output per token is $\{E_i\}$ 2, where only $\{E_i\}$ 3 elements of $\{E_i\}$ 4 are nonzero.

This per-token, sparse activation leads to conditional computation: a large pool of experts is available, but only a small, token-dependent subset is executed per forward pass.

Enhancements and Variants

Mixture of Grouped Experts (MoGE): Experts are partitioned into $\{E_i\}$ 5 device-mapped groups to enforce that each token activates exactly $\{E_i\}$ 6 experts per group. This guarantees perfect inter-device load balance and removes the need for load-balancing heuristics (Tang et al., 27 May 2025).
MoE++: Introduces zero-computation experts (e.g., discard, identity, constant transform) alongside standard FFNs. This allows for token-driven variable compute per token while deploying zero-cost operations for simple cases (Jin et al., 2024).
MaskMoE: Assigns a token-specific, fixed mask over accessible experts based on frequency, ensuring that rare tokens always route to the same expert for robust representation, whereas frequent tokens see diverse experts (Su et al., 2024).
AdaMoE: Augments the expert pool with “null” experts (zero computation), increasing the top- $\{E_i\}$ 7 value. Tokens adaptively select a variable number of true and null experts, reducing computational load for easy tokens (Zeng et al., 2024).
Other Specialized Routing: Alternatives such as dynamic token-aware routers with hypernetworks (Jing et al., 28 May 2025), similarity/attention-aware coupling (Nguyen et al., 1 May 2025), and retrieval-augmented routing (Lyu et al., 5 Jan 2026) increase flexibility and robustness.

2. Load Balancing, Specialization, and Routing Stability

One central technical challenge of token-based MoE is ensuring effective utilization and specialization of experts:

Expert Overload and Stragglers: In naïve top- $\{E_i\}$ 8 gating, some experts are disproportionately selected, causing device imbalance and throughput bottlenecks. MoGE solves this by group-based routing constraints that guarantee per-device workload equality (Imbalance Score $\{E_i\}$ 9) (Tang et al., 27 May 2025).
Auxiliary Losses: Standard models impose auxiliary load-balancing losses to enforce even expert usage. For a batch of size $h$ 0, let $h$ 1 be the fraction of tokens routed to expert $h$ 2 and $h$ 3 the mean gate probability. The auxiliary loss $h$ 4 is added to the main objective (Kang et al., 26 May 2025).
Specialization and Saturation: Trace analyses of FLAME-MoE and similar systems show that, during training, experts rapidly specialize (high assignment purity for token classes) and routing behavior stabilizes early (Kang et al., 26 May 2025).
Routing Stability: Standard MoE routers make independent token-routing decisions, exposing models to routing fluctuations that impair robustness. Coupling token decisions via similarity-aware or attention-aware terms reduces entropy and increases stability (Nguyen et al., 1 May 2025).

3. Efficiency, Scaling Behavior, and System-Level Optimizations

The ability to scale representational capacity at constant or sub-linear compute is a cornerstone of MoE effectiveness.

Sparsity and Parameter Utilization:

For $h$ 5 experts and $h$ 6 activations per token, peak parameter count can reach the multi-tens of billions (e.g., Pangu Pro MoE: 72B total, 16B activated per token; 22% activation) (Tang et al., 27 May 2025).
Relative to dense models that compute all parameters per token, token-based MoEs achieve up to 78% savings in inference compute (Tang et al., 27 May 2025).
Advanced token-based methods such as AdaMoE (Zeng et al., 2024) and MoE++ (Jin et al., 2024) further reduce per-token FLOPs by allowing adaptive or zero-computation routes.

Throughput and Hardware Optimization:

MoGE delivers perfect device utilization; system-level advances, such as H $h$ 7P hybrid parallelism and communication overlap/fusion, yield up to 203% throughput gains over dense architectures (Tang et al., 27 May 2025).
FLAME-MoE and similar models demonstrate that token-based MoEs deliver consistent accuracy improvements (up to 3.4 points) over dense baselines at identical total FLOPs but note that infrastructure and communication overheads remain a limiting factor for scaling (Kang et al., 26 May 2025).

Specialized Hardware Synergy:

Pangu Pro MoE's design is tightly coupled to Ascend NPU system architecture, achieving speculation-driven decode rates of up to 1528 tokens/s per card and a prefill throughput exceeding that of comparable dense LLMs (Tang et al., 27 May 2025).

4. Token-Based MoE Extensions: Multimodal, Continual, and Infinite Experts

Multimodal Routing:

EvoMoE introduces expert evolution (diversified expert initialization from a single FFN seed) and dynamic token-aware routing via hypernetworks conditional on token modality, enabling superior performance on multi-modal LLMs and robust expert specialization (Jing et al., 28 May 2025).

Continuous and Infinite Experts:

$h$ 8-MoE generalizes discrete token-based MoE to a continuous expert space: for each token, the router samples continuous masks, selecting arbitrary neuron subsets, achieving infinite expert capacity with Bayesian-like parameter selection. This allows runtime-tunable speed/accuracy trade-offs and stabile accuracy at high expert counts (Takashiro et al., 25 Jan 2026).

Knowledge Transfer and Hybridization:

HyperMoE distributes knowledge from non-selected experts to each token via token-specific hypernetwork-generated modules, ensuring richer representations without breaking top- $h$ 9 sparsity constraints (Zhao et al., 2024).

Cross-Example Aggregation:

The Mixture-of-Tokens (MoT) reformulation aggregates tokens across different sequences for each expert, enabling fully continuous, cross-example expert mixtures and efficient scaling while maintaining compatibility with causal inference (Antoniak et al., 2023).

5. Practical Frameworks and Industrial Implementations

Production-scale MoE (Pangu Pro, Ascend): 72B parameters, 16B/token activated, group-based routing achieving IS $W\in\mathbb{R}^{d\times N}$ 0, optimized software and kernel stack for NPUs, including speculative decoding, quantization, and hybrid parallelism (Tang et al., 27 May 2025).
FLAME-MoE: A public, end-to-end research platform offering detailed control and transparency over sparse MoE LLMs, including routing diagnostics, co-activation analysis, and confirmed early specialization (Kang et al., 26 May 2025).
MixtureKit: Generalizes token-based MoE research with three strategies (Traditional MoE, BTX, BTS), fine-grained per-token routing (branch routers per FFN projection), StitchLayer hybridization, and diagnostic visualization tools for per-token routing patterns (Chamma et al., 13 Dec 2025).

Framework	Routing Mode	Special Features
Pangu Pro MoE	MoGE, group-constrained	Ascend NPU, hybrid parallelism, spec decode
FLAME-MoE	Top-K token routing + shared	Expert specialization, open logs
MixtureKit	Per-token (BTX/BTS)	Fine-grained, visualization interface

6. Limitations, Challenges, and Future Directions

Challenges in token-based MoE modeling include:

Expert Collapse and Load Imbalance: Naïve top- $W\in\mathbb{R}^{d\times N}$ 1 policies are prone to expert overload and specialization collapse; recent designs address this via group-constrained routing (Tang et al., 27 May 2025), auxiliary balance losses (Kang et al., 26 May 2025), and token-specific masking (Su et al., 2024).
Routing Robustness and Non-Stationarity: Standard tokenwise independence leads to unstable routing trajectories; similarity-coupled and attention-coupled routers offer decreased entropy and increased robustness (Nguyen et al., 1 May 2025).
Hardware and Systems Bottlenecks: Efficient all-to-all communication, kernel fusion, and expert-aware quantization are active areas of optimization (Tang et al., 27 May 2025, Kang et al., 26 May 2025).
Dynamic, Adaptive Compute: Emerging methods feature token-adaptive or continuous expert selection, zero-computation paths, and runtime-tunable compute/accuracy trade-offs (Jin et al., 2024, Zeng et al., 2024, Takashiro et al., 25 Jan 2026).
Theory and Model Selection: Bayesian frameworks such as HS-MoE furnish uncertainty quantification and principled model selection for expert counts, but are not yet dominant in large-scale LLM practice (Polson et al., 14 Jan 2026).

Developments are anticipated in adaptive expert architectures, hardware-aligned MoE designs, robust routing for OOD generalization, and hybrid architectures bridging discrete and continuous expert selection.

Key References:

Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity (Tang et al., 27 May 2025)
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts (Jin et al., 2024)
EvoMoE: Expert Evolution in Mixture of Experts for Multimodal LLMs (Jing et al., 28 May 2025)
Unified Competitive Learning SMoE (USMoE) (Do et al., 29 Mar 2025)
FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts LLMs (Kang et al., 26 May 2025)
MixtureKit: A General Framework for Composing, Training, and Visualizing Mixture-of-Experts Models (Chamma et al., 13 Dec 2025)
Mixture of Group Experts for Learning Invariant Representations (Kang et al., 12 Apr 2025)
AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts LLMs (Zeng et al., 2024)
HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts (Zhao et al., 2024)
Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts (Lyu et al., 5 Jan 2026)
Horseshoe Mixtures-of-Experts (HS-MoE) (Polson et al., 14 Jan 2026)
Mixture of Attention Heads: Selecting Attention Heads Per Token (Zhang et al., 2022)
Improving Routing in Sparse Mixture of Experts with Graph of Tokens (Nguyen et al., 1 May 2025)
Stable-MoE: Lyapunov-based Token Routing for Distributed Mixture-of-Experts Training over Edge Networks (Shi et al., 7 Dec 2025)
Mixture of Tokens: Continuous MoE through Cross-Example Aggregation (Antoniak et al., 2023)
$W\in\mathbb{R}^{d\times N}$ 2-MoE: Generalizing Mixture of Experts to Infinite Experts (Takashiro et al., 25 Jan 2026)