Papers
Topics
Authors
Recent
Search
2000 character limit reached

AdaMoE: Adaptive Routing with Null Experts

Updated 25 February 2026
  • The paper introduces AdaMoE, a token-adaptive routing mechanism with null experts that selects true experts based on each token's complexity.
  • AdaMoE minimizes computational cost by reducing FLOPs by up to 15% while preserving or improving downstream performance across benchmarks.
  • AdaMoE integrates seamlessly into pretrained MoE-LLMs with minimal architectural changes, facilitating effective fine-tuning and load balancing.

Adaptive Routing with Null Experts (AdaMoE) is a method for token-adaptive routing in Mixture-of-Experts (MoE) architectures, designed to dynamically allocate computational resources across tokens in LLMs. AdaMoE introduces "null experts," which allow each token to select a variable number of true experts based on its individual complexity, enabling efficient utilization of model capacity without increasing FLOPs. The mechanism applies to both pretraining and fine-tuning of MoE-based LLMs, and deploys minimal architectural changes to standard MoE layers, ensuring compatibility with existing infrastructure. Empirical evaluation demonstrates that AdaMoE achieves substantial reductions in average expert load (FLOPs) while improving or matching downstream performance across a diverse set of benchmarks (Zeng et al., 2024).

1. Architectural Design

AdaMoE is based on the conventional sparse MoE layer, comprising nn "true" experts (e.g., feed-forward networks, LoRA modules) and a router network that selects the top-kk experts per token. The AdaMoE modification supplements the expert set with mm "null" experts, which are identity or zero mappings incurring zero FLOPs. The router output dimension is thus extended to n+mn+m. Routing is performed by increasing the fan-out to k=k+mk' = k + m, allowing each token to be dispatched to a mixture of true and null experts.

The selection process becomes token-adaptive: tokens requiring more complex transformations select more true experts (fewer nulls), while simpler tokens route to more null experts, thereby skipping unnecessary computation. Despite this structural adaptation, the layer interface and call signature remain unchanged—only the expert selection mechanism is modified.

2. Gating Formulation and Routing Dynamics

Given input token representations xRdx \in \mathbb{R}^d, the router computes gating scores via a learned matrix WgRd×(n+m)W_g \in \mathbb{R}^{d \times (n+m)}, yielding pre-softmax scores z=xWgz = x W_g. The kk' largest entries in zz for each token are retained, with all others set to -\infty through a TopK operation. Following this, a sparse softmax yields p=Softmax(TopK(z,k))Rn+mp = \mathrm{Softmax}(\mathrm{TopK}(z, k')) \in \mathbb{R}^{n+m}, ensuring only kk' nonzero values per token.

Indices 1,,n1,\ldots,n are assigned to true experts; n+1,,n+mn+1,\ldots,n+m correspond to null experts. Token output is computed as

y=i=1n+mpiEi(x),y = \sum_{i=1}^{n+m} p_i E_i(x),

where Ei(x)=0E_i(x) = 0 for null experts, implying zero computational cost. To align output magnitudes with standard MoE, AdaMoE optionally renormalizes pp over just the true experts in the selected top-kk' set Strue={initop-k}S_{\text{true}}=\{i\leq n \mid i\in \text{top-}k'\}:

piexp(zi)jStrueexp(zj)for iStrue,p_i \leftarrow \frac{\exp(z_i)}{\sum_{j\in S_{\text{true}}} \exp(z_j)} \quad \text{for}\ i \in S_{\text{true}},

leaving pj=0p_j=0 for null experts. This normalization ensures output consistency across expert selection patterns (Zeng et al., 2024).

3. Load-Balancing Objective with Null Experts

To prevent degenerate routing (underutilization of true or null experts), AdaMoE employs a modified load-balancing loss. In standard MoE, the loss is

load=αni=1nfiPi,\ell_{\text{load}} = \alpha\, n \sum_{i=1}^{n} f_i P_i,

where fif_i denotes the batchwise frequency of expert ii being selected, and PiP_i is the mean pre-routing softmax probability. In AdaMoE, given the equivalence among null experts, their loads are aggregated. The revised loss is

null=α(n+m)i=1n+mf~iPi\ell_{\text{null}} = \alpha\,(n+m) \sum_{i=1}^{n+m} \tilde{f}_i P_i

with

f~i={fi,in 1mj=n+1n+mfj,i>n\tilde{f}_i = \begin{cases} f_i,& i\leq n\ \frac{1}{m} \sum_{j=n+1}^{n+m} f_j,& i>n \end{cases}

This encourages proportional usage of true and null experts, directly controlling average FLOPs and expert load without forcing an artificial balance among equivalent null experts.

4. Token Routing and Inference Procedure

The AdaMoE forward pass is summarized by the following sequence:

  1. Compute routing scores via matrix multiplication: Z=X@WgZ = X @ W_g.
  2. Mask all but the top-kk' routing scores for each token.
  3. Apply softmax over masked scores to obtain dispatch weights.
  4. For each true expert, perform the forward computation weighted by the routing probability; null experts are omitted as they contribute zero.
  5. During training, compute the batchwise load-balancing loss null\ell_{\text{null}} using the aggregated expert selection statistics.

Null expert selection adapts at the token level, allowing downstream tasks to selectively allocate model capacity based on input complexity.

5. Integration with Pretrained MoE-LLMs

AdaMoE is designed for seamless integration into existing pretrained MoE-LLMs. For example, adapting a Mixtral-8×7B model involves:

  • Expanding the router output dimension from 88+m8 \rightarrow 8+m (e.g., by introducing an additional LoRA "gate2" module for null experts, initialized via copying/repeating the original gating parameters).
  • Selecting hyperparameters such as number of null experts mm (e.g., m=8m=8) and total selected experts per token kk' (e.g., k=3k'=3 or $4$).
  • Scheduling the load-balancing loss: a higher weight α1\alpha_1 in the initial epoch enforces load uniformity, followed by a lower α2\alpha_2 for adaptive expert selection.
  • Employing quantized LoRA (4-bit QLoRA), with rank $8$ adapters on relevant submodules and a learning rate of 5×1055 \times 10^{-5}.
  • Applying fine-tuning with approximately $1000$ samples per task for $2$ epochs.

This approach retains compatibility with auto-regressive modeling and incurs no future token peeking.

6. Empirical Evaluation and Ablation Analysis

AdaMoE was evaluated on both Mixtral-8×7B (MoE-LLM) and Llama2-7B with Mo-LoRA experts. Downstream benchmarks included semantic understanding (RTE, COLA) and commonsense reasoning tasks (ScienceQA, CommonsenseQA, OpenBookQA, WinoGrande, HellaSwag, PIQA, SocialIQA, ARC-Challenge). Metrics were accuracy and average expert load (measured in FLOPs).

Key empirical results for Mixtral-8×7B with AdaMoE (m=8,k=3m=8,\,k'=3):

Task Accuracy (vanilla) Accuracy (AdaMoE) Expert Load (vanilla) Expert Load (AdaMoE)
ARC-Challenge 87.46 89.15 2.00 1.67
WinoGrande 80.43 81.93 2.00 1.66
HellaSwag 84.10 85.50 2.00 1.68
PIQA 90.48 90.32 2.00 1.59
SocialIQA 76.36 76.97 2.00 1.63

On average across six tasks, AdaMoE reduced FLOPs by approximately 15.2%15.2\% with improved or maintained accuracy. LoRA-based AdaMoE on Llama2-7B further demonstrated consistent outperformance relative to vanilla Mo-LoRA under similar or reduced computational budgets. For example, on the RTE dataset, the baseline accuracy/load was 65.1%/1.0065.1\%/1.00, while AdaMoE with m=5,k=2m=5,\,k'=2 achieved 67.0%/0.4467.0\%/0.44.

Ablation studies revealed:

  • Shared load balancing among null experts (as in null\ell_{\text{null}}) yields up to +10%+10\% accuracy improvement over balancing each null individually.
  • Annealing the load-loss coefficient from high to low fosters initial expert utilization balance without sacrificing final performance.
  • Renormalizing routing weights over true experts rather than all selected experts provides a mean +1%+1\% accuracy gain.
  • Excessive null experts (e.g., m=32m=32) with low kk' sharply reduce expert load but can degrade accuracy, while moderate settings maintain optimal trade-offs.

7. Comparative Merits and Practical Considerations

AdaMoE achieves token-level adaptive sparsity without complex auxiliary routing networks or future-seeing mechanisms. It enables granular control over computational costs per input, leveraging joint optimization of expert selection and load balancing. Empirically, AdaMoE yields $10$–20%20\% average FLOPs reductions at identical or higher downstream accuracy relative to standard MoE LLMs.

AdaMoE thus provides an efficient and easily adoptable mechanism for token-adaptive compute allocation in production-scale LLMs, with minimal impact on compatibility and performance (Zeng et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Routing with Null Experts (AdaMoE).