AdaMoE: Adaptive Routing with Null Experts
- The paper introduces AdaMoE, a token-adaptive routing mechanism with null experts that selects true experts based on each token's complexity.
- AdaMoE minimizes computational cost by reducing FLOPs by up to 15% while preserving or improving downstream performance across benchmarks.
- AdaMoE integrates seamlessly into pretrained MoE-LLMs with minimal architectural changes, facilitating effective fine-tuning and load balancing.
Adaptive Routing with Null Experts (AdaMoE) is a method for token-adaptive routing in Mixture-of-Experts (MoE) architectures, designed to dynamically allocate computational resources across tokens in LLMs. AdaMoE introduces "null experts," which allow each token to select a variable number of true experts based on its individual complexity, enabling efficient utilization of model capacity without increasing FLOPs. The mechanism applies to both pretraining and fine-tuning of MoE-based LLMs, and deploys minimal architectural changes to standard MoE layers, ensuring compatibility with existing infrastructure. Empirical evaluation demonstrates that AdaMoE achieves substantial reductions in average expert load (FLOPs) while improving or matching downstream performance across a diverse set of benchmarks (Zeng et al., 2024).
1. Architectural Design
AdaMoE is based on the conventional sparse MoE layer, comprising "true" experts (e.g., feed-forward networks, LoRA modules) and a router network that selects the top- experts per token. The AdaMoE modification supplements the expert set with "null" experts, which are identity or zero mappings incurring zero FLOPs. The router output dimension is thus extended to . Routing is performed by increasing the fan-out to , allowing each token to be dispatched to a mixture of true and null experts.
The selection process becomes token-adaptive: tokens requiring more complex transformations select more true experts (fewer nulls), while simpler tokens route to more null experts, thereby skipping unnecessary computation. Despite this structural adaptation, the layer interface and call signature remain unchanged—only the expert selection mechanism is modified.
2. Gating Formulation and Routing Dynamics
Given input token representations , the router computes gating scores via a learned matrix , yielding pre-softmax scores . The largest entries in for each token are retained, with all others set to through a TopK operation. Following this, a sparse softmax yields , ensuring only nonzero values per token.
Indices are assigned to true experts; correspond to null experts. Token output is computed as
where for null experts, implying zero computational cost. To align output magnitudes with standard MoE, AdaMoE optionally renormalizes over just the true experts in the selected top- set :
leaving for null experts. This normalization ensures output consistency across expert selection patterns (Zeng et al., 2024).
3. Load-Balancing Objective with Null Experts
To prevent degenerate routing (underutilization of true or null experts), AdaMoE employs a modified load-balancing loss. In standard MoE, the loss is
where denotes the batchwise frequency of expert being selected, and is the mean pre-routing softmax probability. In AdaMoE, given the equivalence among null experts, their loads are aggregated. The revised loss is
with
This encourages proportional usage of true and null experts, directly controlling average FLOPs and expert load without forcing an artificial balance among equivalent null experts.
4. Token Routing and Inference Procedure
The AdaMoE forward pass is summarized by the following sequence:
- Compute routing scores via matrix multiplication: .
- Mask all but the top- routing scores for each token.
- Apply softmax over masked scores to obtain dispatch weights.
- For each true expert, perform the forward computation weighted by the routing probability; null experts are omitted as they contribute zero.
- During training, compute the batchwise load-balancing loss using the aggregated expert selection statistics.
Null expert selection adapts at the token level, allowing downstream tasks to selectively allocate model capacity based on input complexity.
5. Integration with Pretrained MoE-LLMs
AdaMoE is designed for seamless integration into existing pretrained MoE-LLMs. For example, adapting a Mixtral-8×7B model involves:
- Expanding the router output dimension from (e.g., by introducing an additional LoRA "gate2" module for null experts, initialized via copying/repeating the original gating parameters).
- Selecting hyperparameters such as number of null experts (e.g., ) and total selected experts per token (e.g., or $4$).
- Scheduling the load-balancing loss: a higher weight in the initial epoch enforces load uniformity, followed by a lower for adaptive expert selection.
- Employing quantized LoRA (4-bit QLoRA), with rank $8$ adapters on relevant submodules and a learning rate of .
- Applying fine-tuning with approximately $1000$ samples per task for $2$ epochs.
This approach retains compatibility with auto-regressive modeling and incurs no future token peeking.
6. Empirical Evaluation and Ablation Analysis
AdaMoE was evaluated on both Mixtral-8×7B (MoE-LLM) and Llama2-7B with Mo-LoRA experts. Downstream benchmarks included semantic understanding (RTE, COLA) and commonsense reasoning tasks (ScienceQA, CommonsenseQA, OpenBookQA, WinoGrande, HellaSwag, PIQA, SocialIQA, ARC-Challenge). Metrics were accuracy and average expert load (measured in FLOPs).
Key empirical results for Mixtral-8×7B with AdaMoE ():
| Task | Accuracy (vanilla) | Accuracy (AdaMoE) | Expert Load (vanilla) | Expert Load (AdaMoE) |
|---|---|---|---|---|
| ARC-Challenge | 87.46 | 89.15 | 2.00 | 1.67 |
| WinoGrande | 80.43 | 81.93 | 2.00 | 1.66 |
| HellaSwag | 84.10 | 85.50 | 2.00 | 1.68 |
| PIQA | 90.48 | 90.32 | 2.00 | 1.59 |
| SocialIQA | 76.36 | 76.97 | 2.00 | 1.63 |
On average across six tasks, AdaMoE reduced FLOPs by approximately with improved or maintained accuracy. LoRA-based AdaMoE on Llama2-7B further demonstrated consistent outperformance relative to vanilla Mo-LoRA under similar or reduced computational budgets. For example, on the RTE dataset, the baseline accuracy/load was , while AdaMoE with achieved .
Ablation studies revealed:
- Shared load balancing among null experts (as in ) yields up to accuracy improvement over balancing each null individually.
- Annealing the load-loss coefficient from high to low fosters initial expert utilization balance without sacrificing final performance.
- Renormalizing routing weights over true experts rather than all selected experts provides a mean accuracy gain.
- Excessive null experts (e.g., ) with low sharply reduce expert load but can degrade accuracy, while moderate settings maintain optimal trade-offs.
7. Comparative Merits and Practical Considerations
AdaMoE achieves token-level adaptive sparsity without complex auxiliary routing networks or future-seeing mechanisms. It enables granular control over computational costs per input, leveraging joint optimization of expert selection and load balancing. Empirically, AdaMoE yields $10$– average FLOPs reductions at identical or higher downstream accuracy relative to standard MoE LLMs.
AdaMoE thus provides an efficient and easily adoptable mechanism for token-adaptive compute allocation in production-scale LLMs, with minimal impact on compatibility and performance (Zeng et al., 2024).