AdaMoE: Adaptive Routing with Null Experts

Updated 25 February 2026

The paper introduces AdaMoE, a token-adaptive routing mechanism with null experts that selects true experts based on each token's complexity.
AdaMoE minimizes computational cost by reducing FLOPs by up to 15% while preserving or improving downstream performance across benchmarks.
AdaMoE integrates seamlessly into pretrained MoE-LLMs with minimal architectural changes, facilitating effective fine-tuning and load balancing.

Adaptive Routing with Null Experts (AdaMoE) is a method for token-adaptive routing in Mixture-of-Experts (MoE) architectures, designed to dynamically allocate computational resources across tokens in LLMs. AdaMoE introduces "null experts," which allow each token to select a variable number of true experts based on its individual complexity, enabling efficient utilization of model capacity without increasing FLOPs. The mechanism applies to both pretraining and fine-tuning of MoE-based LLMs, and deploys minimal architectural changes to standard MoE layers, ensuring compatibility with existing infrastructure. Empirical evaluation demonstrates that AdaMoE achieves substantial reductions in average expert load (FLOPs) while improving or matching downstream performance across a diverse set of benchmarks (Zeng et al., 2024).

1. Architectural Design

AdaMoE is based on the conventional sparse MoE layer, comprising $n$ "true" experts (e.g., feed-forward networks, LoRA modules) and a router network that selects the top- $k$ experts per token. The AdaMoE modification supplements the expert set with $m$ "null" experts, which are identity or zero mappings incurring zero FLOPs. The router output dimension is thus extended to $n+m$ . Routing is performed by increasing the fan-out to $k' = k + m$ , allowing each token to be dispatched to a mixture of true and null experts.

The selection process becomes token-adaptive: tokens requiring more complex transformations select more true experts (fewer nulls), while simpler tokens route to more null experts, thereby skipping unnecessary computation. Despite this structural adaptation, the layer interface and call signature remain unchanged—only the expert selection mechanism is modified.

2. Gating Formulation and Routing Dynamics

Given input token representations $x \in \mathbb{R}^d$ , the router computes gating scores via a learned matrix $W_g \in \mathbb{R}^{d \times (n+m)}$ , yielding pre-softmax scores $z = x W_g$ . The $k'$ largest entries in $z$ for each token are retained, with all others set to $-\infty$ through a TopK operation. Following this, a sparse softmax yields $p = \mathrm{Softmax}(\mathrm{TopK}(z, k')) \in \mathbb{R}^{n+m}$ , ensuring only $k'$ nonzero values per token.

Indices $1,\ldots,n$ are assigned to true experts; $n+1,\ldots,n+m$ correspond to null experts. Token output is computed as

$y = \sum_{i=1}^{n+m} p_i E_i(x),$

where $E_i(x) = 0$ for null experts, implying zero computational cost. To align output magnitudes with standard MoE, AdaMoE optionally renormalizes $p$ over just the true experts in the selected top- $k'$ set $S_{\text{true}}=\{i\leq n \mid i\in \text{top-}k'\}$ :

$p_i \leftarrow \frac{\exp(z_i)}{\sum_{j\in S_{\text{true}}} \exp(z_j)} \quad \text{for}\ i \in S_{\text{true}},$

leaving $p_j=0$ for null experts. This normalization ensures output consistency across expert selection patterns (Zeng et al., 2024).

3. Load-Balancing Objective with Null Experts

To prevent degenerate routing (underutilization of true or null experts), AdaMoE employs a modified load-balancing loss. In standard MoE, the loss is

$\ell_{\text{load}} = \alpha\, n \sum_{i=1}^{n} f_i P_i,$

where $f_i$ denotes the batchwise frequency of expert $i$ being selected, and $P_i$ is the mean pre-routing softmax probability. In AdaMoE, given the equivalence among null experts, their loads are aggregated. The revised loss is

$\ell_{\text{null}} = \alpha\,(n+m) \sum_{i=1}^{n+m} \tilde{f}_i P_i$

with

$\tilde{f}_i = \begin{cases} f_i,& i\leq n\ \frac{1}{m} \sum_{j=n+1}^{n+m} f_j,& i>n \end{cases}$

This encourages proportional usage of true and null experts, directly controlling average FLOPs and expert load without forcing an artificial balance among equivalent null experts.

4. Token Routing and Inference Procedure

The AdaMoE forward pass is summarized by the following sequence:

Compute routing scores via matrix multiplication: $Z = X @ W_g$ .
Mask all but the top- $k'$ routing scores for each token.
Apply softmax over masked scores to obtain dispatch weights.
For each true expert, perform the forward computation weighted by the routing probability; null experts are omitted as they contribute zero.
During training, compute the batchwise load-balancing loss $\ell_{\text{null}}$ using the aggregated expert selection statistics.

Null expert selection adapts at the token level, allowing downstream tasks to selectively allocate model capacity based on input complexity.

5. Integration with Pretrained MoE-LLMs

AdaMoE is designed for seamless integration into existing pretrained MoE-LLMs. For example, adapting a Mixtral-8×7B model involves:

Expanding the router output dimension from $8 \rightarrow 8+m$ (e.g., by introducing an additional LoRA "gate2" module for null experts, initialized via copying/repeating the original gating parameters).
Selecting hyperparameters such as number of null experts $m$ (e.g., $m=8$ ) and total selected experts per token $k'$ (e.g., $k'=3$ or $4$).
Scheduling the load-balancing loss: a higher weight $\alpha_1$ in the initial epoch enforces load uniformity, followed by a lower $\alpha_2$ for adaptive expert selection.
Employing quantized LoRA (4-bit QLoRA), with rank $8$ adapters on relevant submodules and a learning rate of $5 \times 10^{-5}$ .
Applying fine-tuning with approximately $1000$ samples per task for $2$ epochs.

This approach retains compatibility with auto-regressive modeling and incurs no future token peeking.

6. Empirical Evaluation and Ablation Analysis

AdaMoE was evaluated on both Mixtral-8×7B (MoE-LLM) and Llama2-7B with Mo-LoRA experts. Downstream benchmarks included semantic understanding (RTE, COLA) and commonsense reasoning tasks (ScienceQA, CommonsenseQA, OpenBookQA, WinoGrande, HellaSwag, PIQA, SocialIQA, ARC-Challenge). Metrics were accuracy and average expert load (measured in FLOPs).

Key empirical results for Mixtral-8×7B with AdaMoE ( $m=8,\,k'=3$ ):

Task	Accuracy (vanilla)	Accuracy (AdaMoE)	Expert Load (vanilla)	Expert Load (AdaMoE)
ARC-Challenge	87.46	89.15	2.00	1.67
WinoGrande	80.43	81.93	2.00	1.66
HellaSwag	84.10	85.50	2.00	1.68
PIQA	90.48	90.32	2.00	1.59
SocialIQA	76.36	76.97	2.00	1.63

On average across six tasks, AdaMoE reduced FLOPs by approximately $15.2\%$ with improved or maintained accuracy. LoRA-based AdaMoE on Llama2-7B further demonstrated consistent outperformance relative to vanilla Mo-LoRA under similar or reduced computational budgets. For example, on the RTE dataset, the baseline accuracy/load was $65.1\%/1.00$ , while AdaMoE with $m=5,\,k'=2$ achieved $67.0\%/0.44$ .

Ablation studies revealed:

Shared load balancing among null experts (as in $\ell_{\text{null}}$ ) yields up to $+10\%$ accuracy improvement over balancing each null individually.
Annealing the load-loss coefficient from high to low fosters initial expert utilization balance without sacrificing final performance.
Renormalizing routing weights over true experts rather than all selected experts provides a mean $+1\%$ accuracy gain.
Excessive null experts (e.g., $m=32$ ) with low $k'$ sharply reduce expert load but can degrade accuracy, while moderate settings maintain optimal trade-offs.

7. Comparative Merits and Practical Considerations

AdaMoE achieves token-level adaptive sparsity without complex auxiliary routing networks or future-seeing mechanisms. It enables granular control over computational costs per input, leveraging joint optimization of expert selection and load balancing. Empirically, AdaMoE yields $10$– $20\%$ average FLOPs reductions at identical or higher downstream accuracy relative to standard MoE LLMs.

AdaMoE thus provides an efficient and easily adoptable mechanism for token-adaptive compute allocation in production-scale LLMs, with minimal impact on compatibility and performance (Zeng et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Routing with Null Experts (AdaMoE).