Token-Based Mixture-of-Experts Models
- Token-based MoE models are sparse neural architectures that dynamically route each input token to a select group of expert sub-networks per layer.
- They employ advanced routing algorithms such as fixed top-K, AdaMoE, and sequence-level strategies to balance load and enhance performance across various applications.
- These models achieve significant scaling and efficiency improvements while mitigating issues like representation collapse through specialized regularization and load-balancing techniques.
Token-based mixture-of-experts (MoE) models are a class of sparse neural architectures where each input token is dynamically routed to a selected subset of specialized expert sub-networks at each layer or sub-layer of a larger model. By selecting only a small fraction of all available experts per token and per layer, these models achieve significant increases in parameter capacity with only modest growth in computational or memory requirements. Token-based routing enables fine-grained conditional computation and specialization, and has become foundational in large-scale language, vision, and multi-modal models.
1. Token Routing Principles in Mixture-of-Experts Architectures
At the core of token-based MoE is a routing (gating) network that determines, for each input token representation , a sparse set of expert indices to activate. Classical architectures utilize a linear gating function
followed by a softmax over experts, top- sparsification, and optional renormalization: The model output for a token is then a weighted sum over the selected experts: where are feed-forward “experts” and .
Recent advances have produced extensions and variants that adaptively vary the number of experts per token (Zeng et al., 2024), incorporate “null” or inactive experts, or allow continuous, fully differentiable mixtures across tokens, as in Mixture-of-Tokens (MoT) (Antoniak et al., 2023). Some designs enable per-position or even per-sub-token routing via multi-head or multi-split mechanisms (Wu et al., 2024).
2. Token Routing Algorithms: Variants and Enhancements
Modern MoE models for language, vision, and recommendation tasks implement a range of token-level routing algorithms:
- Fixed Top- token routing: Each token independently selects experts with the highest gating scores (Antoine et al., 2024, Roy et al., 3 Jul 2025). This design does not recognize variable token complexity or resource constraints.
- Adaptive routing with null experts (AdaMoE): By introducing a set of “null” experts with zero FLOPs cost, AdaMoE enables each token to select a variable number of true experts, while a load-balancing loss encourages efficient coverage (Zeng et al., 2024).
- Sequence-level top- (SeqTopK): Sequence-level routing budgets the total number of expert activations across an entire sequence, enabling difficult tokens to use more experts and easy ones fewer, under the same global budget. Implementation requires minimal code changes but yields improved load balancing and accuracy in high-sparsity regimes (Wen et al., 9 Nov 2025).
- Mask-based routing (MaskMoE): MaskMoE statically fixes a subset of allowed expert targets per token type, based on frequency (e.g., rare tokens use a single fixed expert, frequent tokens retain dynamic routing among several) (Su et al., 2024).
- Similarity/Attention-aware routing: These algorithms augment token routing with information from similarity graphs or attention matrices to stabilize expert assignments and improve robustness (Nguyen et al., 1 May 2025).
- Dynamic expert allocation in recommendation: In the MTmixAtt architecture, heterogeneous input features are mapped to tokens via learned soft clustering, and each token is routed to differentiated shared and scenario-specific experts (Qi et al., 17 Oct 2025).
3. Optimization, Specialization, and Iterative Expert Communication
Expert specialization and stable feature reuse are primary benefits of token-based MoEs:
- Dynamic multi-step selection (Chain-of-Experts, CoE): CoE introduces a novel form of intra-layer expert chaining, where token representations are iteratively routed through independent routers and experts multiple times per layer. This provides a third depth dimension to model scaling, with K experts per step and C steps yielding possible expert combination paths, vastly increasing representational diversity (Wang et al., 23 Jun 2025).
- Specialization by linguistic features: Analysis of routing traces in large LLMs reveals robust expert specialization by part-of-speech categories and other linguistic factors, even without explicit supervision. Simple linear probes can recover 80–89% POS classification accuracy from token routing paths alone (Antoine et al., 2024).
- Gradient-based conflict mitigation: In multitask and multimodal settings, optimizing the router to route tokens with conflicting gradients to different experts (solving token gradient conflict, STGC) increases expert specialization and empirical performance (Yang et al., 2024).
A unifying observation is that deeper or iterative routing (e.g., CoE) and similarity/attention-based algorithms combine better expert specialization with stabilized, consistent routing patterns, reducing the risk of late-stage routing instability or representation collapse (Nguyen et al., 1 May 2025, Wu et al., 2024).
4. Representational Collapse, Load Balancing, and Routing Stability
Token-based MoEs are susceptible to representation collapse and routing imbalances if not carefully regularized:
- Representation collapse: When the routing network concentrates assignments on a small subset of experts or confines the hidden states to the span of expert centroids, model expressivity is diminished (Chi et al., 2022). Approaches such as hyperspherical routing (X-MoE), stochastic learning against expert collapse (S2MoE), or uncertainty-aware training alleviate this by increasing the effective rank of token–expert interactions and diversifying the gradient flows (Chi et al., 2022, Do et al., 29 Mar 2025).
- Load balancing: Auxiliary losses encourage even token distribution among experts, minimizing “dead” or oversubscribed experts, and can be extended to handle variable-width (adaptive) routing as in AdaMoE and MaskMoE (Zeng et al., 2024, Su et al., 2024).
- Routing fluctuation: Routing instabilities, where token-to-expert assignments change late in training, are mitigated by attention/similarity-aware graphs, explicit entropy minimization, and memory of routing paths (Nguyen et al., 1 May 2025). Metrics such as fluctuation rate and gating entropy quantify these phenomena.
5. Engineering, Systems, and Applications
Practical deployment of token-based MoEs introduces challenges and design opportunities:
- Scaling compute and memory: Techniques for expert selection and memory-efficient caching (ExpertFlow) allow large sparse MoEs to run on a single GPU by batching, prefetching, and predicting expert needs, with I/O-optimized scheduling for dynamic workload balancing (He et al., 2024).
- Distributed and edge scenarios: Resource- and queue-aware routing (Stable-MoE) distributes token batches across heterogeneous edge devices, maximizing throughput and consistency under resource constraints using Lyapunov-based drift-plus-penalty optimization (Shi et al., 7 Dec 2025).
- Behavior steering and interpretability: MoTE demonstrates that in MoEs with large expert sets, targeted manipulation of a small subset of experts at inference can induce desired behavioral changes, such as reducing refusals or controlling language use, without retraining. This provides a pathway to interpretable and steerable LLMs by modular expert activation (Dahlke et al., 16 Feb 2025).
- Fine-grained and hierarchical routing: BTX-style architectures (MixtureKit) support multiple token routers per FFN, enabling even greater specialization at sub-layer or projection level, which is beneficial in multilingual and code-switched contexts (Chamma et al., 13 Dec 2025). Multi-head MoE layers split each token into sub-units, distributing them to different experts and recombining the results, dramatically increasing expert utilization (Wu et al., 2024).
6. Empirical Results and Future Directions
Token-based MoE models consistently outperform dense and non-adaptive sparse baselines across language modeling, multilingual, vision-language, and industrial recommendation tasks:
- Gains of up to 8–17 points in accuracy or perplexity reductions have been documented under adaptive/sequence-level or multi-step routing (Wang et al., 23 Jun 2025, Wen et al., 9 Nov 2025, Do et al., 29 Mar 2025).
- Efficient token routing enables memory and FLOPs reductions of 15–42% while preserving or improving predictive accuracy (Zeng et al., 2024, He et al., 2024).
- Load balancing and stability-oriented designs yield orders-of-magnitude improvements in throughput and robustness when scaling to massive expert sets or deploying on constrained or distributed infrastructures (He et al., 2024, Shi et al., 7 Dec 2025).
Future work will likely address adaptive, context-driven and hybrid routing strategies; automated expert budget allocation and dynamic resource usage; integration of interpretable and controllable expert mechanisms; as well as formalization of the role of token routing in model scaling, knowledge transfer, and emergent specialization.
Key Papers Referenced:
- "Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models" (Wang et al., 23 Jun 2025)
- "AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts LLMs" (Zeng et al., 2024)
- "Part-Of-Speech Sensitivity of Routers in Mixture of Experts Models" (Antoine et al., 2024)
- "Improving Routing in Sparse Mixture of Experts with Graph of Tokens" (Nguyen et al., 1 May 2025)
- "MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts" (Su et al., 2024)
- "Route Experts by Sequence, not by Token" (Wen et al., 9 Nov 2025)
- "Mixture of Tunable Experts -- Behavior Modification of DeepSeek-R1 at Inference Time" (Dahlke et al., 16 Feb 2025)
- "Stable-MoE: Lyapunov-based Token Routing for Distributed Mixture-of-Experts Training over Edge Networks" (Shi et al., 7 Dec 2025)
- "Multi-Head Mixture-of-Experts" (Wu et al., 2024)