AdaMoE: Adaptive Mixture of Experts

Updated 16 March 2026

AdaMoE is a dynamic mixture-of-experts framework that adapts expert participation per input, enabling context-aware specialization beyond fixed top-k routing.
It employs mechanisms like null experts, decoupled routers, scale adapters, and statistical blending to optimize performance while reducing computational overhead.
AdaMoE demonstrates improved generalization and efficiency in applications ranging from large language models and vision-language-action tasks to time series forecasting and CTR prediction.

AdaMoE (Adaptive Mixture of Experts) encompasses a family of architectures and routing strategies that extend the conventional Mixture-of-Experts (MoE) paradigm by introducing adaptive, dynamic, or contextually decoupled expert selection and weighting. Originating independently in several domains—including LLMs, real-time vision-language-action (VLA) decision policies, online concept drift adaptation, and spectral time series forecasting—the unifying principle of AdaMoE is to transcend static or globally uniform expert selection with mechanisms that tailor expert participation to the input context or task requirements. This is realized via a range of architectural and algorithmic innovations across recent works (Shen et al., 16 Oct 2025, Zeng et al., 2024, Liu et al., 2022, Liu et al., 2024, Ni et al., 29 Nov 2025).

1. Core Motivation and Generalized Problem Statement

Traditional MoE architectures, including those adopted in Transformer-based LLMs, VLA policies, or sequential prediction tasks, generally apply a fixed “top-k” routing: each token or input activates exactly k out of N experts, selected by a router (typically a softmax layer or learned gating). This rigid scheme introduces two key limitations:

Lack of input-adaptivity: All tokens, regardless of complexity or informativeness, consume identical computational resources, leading to underutilization for simple inputs and insufficient expressivity for rare or ambiguous cases.
Monopolistic expert utilization: The joint role of routers in both selection and weighting can create “winner-takes-all” or expert collapse phenomena, especially when load balancing and specialization objectives conflict.

AdaMoE frameworks were devised to introduce context-sensitive flexibility—allowing the architecture to modulate expert participation (either the set, count, or blending coefficients of experts) per token, per task phase, or per spectral band. This adaptive specialization is designed to achieve three objectives:

Maximize model expressivity without proportional increase in per-input compute.
Balance rapid adaptation (plasticity) and generalization (stability), especially under non-stationary or open-world regimes.
Enable transferability and parameter efficiency by reusing pretrained backbones or adapters, with negligible training/inference overhead.

2. Representative AdaMoE Architectures and Mechanisms

The AdaMoE concept encompasses multiple concrete realizations, each optimizing for its domain-specific constraints. The following table summarizes the core variants:

Variant	Routing Mechanism	Application Domain
AdaMoE-VLA (Shen et al., 16 Oct 2025)	Decoupled router + scale adapter (top-k)	Vision-Language-Action learning
AdaMoE-LLM (Zeng et al., 2024)	Token-adaptive routing via null experts	Token-wise sparse LLMs
AdaMoE-CTR (Liu et al., 2022)	Statistically-blended expert weights	CTR/concept drift
AdaMoLE (Liu et al., 2024)	Adaptive thresholded LoRA gating	Fine-tuning LLMs/PEFT
Ada-MoGE (Ni et al., 29 Nov 2025)	Frequency-stats-driven expert activation	Time series (spectral) forecasting

AdaMoE for VLA Models

In “Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning” (Shen et al., 16 Oct 2025), AdaMoE replaces dense feedforward layers in the action expert module with sparsely-activated MoE blocks. Each MoE block contains:

One shared expert ( $F_{\text{shared}}$ )
$K$ routed experts ( $\{F_i\}$ )
Two distinct routing streams:
- Router ( $R$ ): Computes softmax scores for load-balanced “top-k” expert selection
- Scale Adapter ( $S$ ): Outputs additive “scale” components, modulating expert contribution weights independently

Formally, for input $x$ :

Scores: $g = \text{softmax}(R(x))$ , $s = S(x)$
Top- $k$ expert indices: $\mathcal{T}(x)$
Final expert weights: $w_i = g_i + s_i$
Output: $F_{\text{MoE}}(x) = F_{\text{shared}}(x) + \sum_{i\in\mathcal{T}(x)} w_i F_i(x)$

This decoupling distinguishes between task-driven specialization (via $S$ ) and load balancing (via $R$ ), addressing expert monopolization and improving collaborative capacity usage.

AdaMoE with Token-Adaptive Routing and Null Experts

AdaMoE as presented for LLMs (“AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts LLMs” (Zeng et al., 2024)) extends fixed top- $k$ routing by introducing $m$ “null” (zero-cost) experts. The router selects $K = k + m$ experts per token; only true experts execute computation, while null experts provide free “slots” that allow tokens to decline expert use when unnecessary.

Per-token, the number of true experts activated is adaptive—optimization is regulated by an auxiliary load-balance loss that targets uniform average null expert usage.
This mechanism reduces average compute (by up to 14.5% FLOPs on ARC-Challenge with Mixtral-8x7B finetuning) and increases performance (by 1.69 pp in accuracy) (Zeng et al., 2024).

Statistical Weighting under Concept Drift

In CTR streaming environments with rapid distributional shift, AdaMoE (Liu et al., 2022) employs a bank of $m$ MLP experts, combined not by parametric routers but by a statistically-constructed, closed-form weight vector:

At each time step $t$ (mini-batch), expert outputs $\hat{y}_{t,i}^{(k)}$ are blended with weights $w_t^{(k)}$ , which are computed via normalized “correctness” tracking and exponentially decayed averaging:

$w_t = \lambda w_{t-1} + (1 - \lambda) \frac{1}{N_t} \sum_{i=1}^{N_t} \frac{\tilde{y}_{t,i}}{\|\tilde{y}_{t,i}\|_1}$

This statistical adaptation achieves online AUC gains (up to +0.23% vs. next-best, +1.8% online CTR in production) with minimal compute and memory (Liu et al., 2022).

Adaptive LoRA Expert Mixtures (AdaMoLE)

AdaMoLE (Liu et al., 2024), while presented as adaptive mixtures of LoRA experts, is structurally an AdaMoE variant. In each adapted Transformer layer:

A context-dependent threshold function $T(x)$ (parameterized via a small linear network) determines which LoRA experts $i$ (out of $K$ ) to activate:

$m_i(x) = \mathbb{I}[s_i(x) > T(x)] \qquad \alpha_i(x) = \frac{m_i(x) \cdot \exp(s_i(x))}{\sum_j m_j(x) \cdot \exp(s_j(x))}$

The final output (for base weights $W_0$ and LoRA updates $\Delta W_i$ ) is:

$h' = W_0 x + \sum_{i=1}^K \alpha_i(x) \Delta W_i x$

AdaMoLE achieves up to 2.5–14 pp gains over single LoRA or static top- $k$ MoE on several reasoning/NLP benchmarks, demonstrating the advantage of dynamically tuning expert participation per context (Liu et al., 2024).

Spectral Band Allocation in Time Series

Ada-MoGE (Ni et al., 29 Nov 2025) adapts the number and identity of Gaussian band-pass “experts” per input based on the spectral distribution extracted from each batch via FFT. A lightweight MLP parses frequency and variable energy vectors ( $\mu$ , $E$ ) to gate experts/filters, with learned band locations and soft/hard attention mechanisms. This allows precise balancing between information retention and noise suppression, yielding state-of-the-art results across 6 long-horizon multivariate forecasting benchmarks, with only 0.2M parameters (Ni et al., 29 Nov 2025).

3. Mathematical and Algorithmic Formulation

The following table outlines key AdaMoE variants and their gating/combination mechanisms:

AdaMoE Variant	Gating/Selection Function	Combination/Output
VLA (Shen et al., 16 Oct 2025)	$g = \text{softmax}(R(x))$ , $s = S(x)$	$F_{\text{shared}}(x) + \sum_{i\in \mathcal{T}(x)} (g_i + s_i) F_i(x)$
Null Expert LLM (Zeng et al., 2024)	Top- $(k+m)$ masking + softmax, with $m$ nulls	$y = \sum_{i=1}^{n+m} p_i(x) E_i(x)$ , $E_{i=n+1,\ldots,n+m}(x)=0$
CTR Drift (Liu et al., 2022)	Closed-form weighted blending, exponential decay	$\hat{y}_{t,i} = \sum_{k=1}^m w_t^{(k)} \hat{y}_{t,i}^{(k)}$
AdaMoLE (Liu et al., 2024)	$m_i(x) = \mathbb{I}[s_i(x) > T(x)]$	$h' = W_0 x + \sum_{i=1}^K \alpha_i(x) \Delta W_i x$
Ada-MoGE (Ni et al., 29 Nov 2025)	Spectral MLP over stats $\chi=[\mu;E]$	$Y(f) = \sum_{k \in \text{Top-}K_t} g_k E_k(H_k(f) X(f))$ , $Y(t)=\text{IFFT}_f[Y(f)]$

Key innovations:

Decoupled selection/weighting: Allows load balancing and specialization to be managed by separate networks (Shen et al., 16 Oct 2025).
Null-expert/no-op inflation: Decouples maximal versus actual expert activity per token (Zeng et al., 2024).
Statistical/blending gate (non-learned): Enables fully non-parametric temporal adaptation to stream drift (Liu et al., 2022).
Threshold-based dynamic masking: Controls expert sparsity through adaptive per-input gating (Liu et al., 2024).
Spectral statistics for selection: Allocates experts based on non-stationary frequency content (Ni et al., 29 Nov 2025).

4. Empirical Performance and Benchmark Results

AdaMoE variants consistently outperform classic MoE and other adaptive models across domains. Key results include:

Vision-Language-Action policy learning (Shen et al., 16 Oct 2025):
- LIBERO: +1.8% (avg.) over dense baseline on fused vision/language-action suites
- RoboTwin v2: +9.3% (average success rate)
- Real robotic dual-arm tasks: +21.5% average improvement in success rate
LLMs with Token-Adaptive Routing (Zeng et al., 2024):
- ARC-Challenge (Mixtral-8x7B): +1.69% accuracy, –14.5% FFN FLOPs
- Across WinoGrande, HellaSwag, PIQA, SIQA, OpenBookQA, ARC-C: ∼15% less compute than fixed-top- $k$ MoE, matched/surpassed accuracy
Online CTR Streaming (Liu et al., 2022):
- Industrial dataset: AUC = 0.7597 (vs. 0.7580 best baseline)
- Avazu: AUC up by 0.23%
- In production: +1.8% CTR, +1.89% eCPM
PEFT for LLMs (AdaMoLE) (Liu et al., 2024):
- CommonsenseQA: 78.71% (AdaMoLE) vs. 76.25% (LoRA)
- SuperGLUE tasks: AdaMoLE outperforms LoRA by 1–14 pp
Time Series Forecasting (Ni et al., 29 Nov 2025):
- ETTh1/ETTh2/ETTm1/ETTm2/ECL/Weather/Solar: Ada-MoGE achieves 51 first-place rankings (across 6 datasets, 4 horizons each), with only 0.2M parameters and up to 5× fewer FLOPs than Transformer baselines

5. Computational and Practical Considerations

AdaMoE architectures emphasize parameter and computational efficiency:

Dynamic FLOP savings: Adaptive expert activation (via null experts or dynamic thresholds) achieves non-uniform per-token inference cost reductions, with minimal router overhead.
Minimal code changes: Most AdaMoE formulations require only marginal modifications to standard MoE layers or adapters.
Hyperparameter sensitivity: Key parameters include the number of experts, null expert fraction, load-balance loss strength, and decay rate for statistical adaptation; these must be tuned to balance sparsity and expressivity.
Transfer and deployment: AdaMoE frameworks have been successfully integrated into production, serving billions of requests (CTR) and direct real-world policy deployments (robot manipulation).

6. Limitations and Open Challenges

While AdaMoE yields strong empirical gains, certain open issues require further investigation:

Expert collapse and router stability: Complete collapse to a subset of experts can forfeit intended specialization, especially when auxiliary balancing terms are not sufficiently tuned (Shen et al., 16 Oct 2025).
Hyperparameter tuning: The effectiveness of adaptive routing is sensitive to the choice of $k$ , number of null experts $m$ , and penalty weights, necessitating non-trivial grid searches per task/hardware (Zeng et al., 2024).
Limited exploration in full pre-training: Most AdaMoE innovations have been validated in fine-tuning or adaptation regimes, rather than end-to-end pre-training of large models (Zeng et al., 2024, Liu et al., 2024).
Extension to multi-task/continual learning: Mechanisms for out-of-distribution expert selection, budget-aware or hierarchical routing, and long-range temporal gating are under-explored (Shen et al., 16 Oct 2025, Ni et al., 29 Nov 2025).
Theoretical analysis: While closed-form or statistically optimized weight updates have been derived (for CTR drift (Liu et al., 2022)), broader guarantees on load balancing, specialization, and transferability remain open.

7. Future Directions

Emerging frontiers for AdaMoE research include:

Generalization to new PEFT and multi-modal adapters: Adapting AdaMoE principles (thresholding, null experts, decoupling) to broader adapter types, including retrieval, prefix-tuning, or memory-augmented modules (Liu et al., 2024).
Resource-adaptive expert scheduling: Formulating adaptive routing as a constrained optimization over compute/memory/FLOPs budgets.
Hierarchical and reinforcement-learned routing: Multi-stage or hierarchical gating, as well as reinforcement/bayesian learned thresholds for per-input expert selection (Liu et al., 2024).
Lifelong and non-stationary regimes: Extending AdaMoE’s dynamic selection mechanisms to continual learning, meta-learning, or explicit concept drift via temporal recurrence or statistics-aware routers (Liu et al., 2022, Ni et al., 29 Nov 2025).
Hybrid frequency/context statistics: Combining spectral statistics with content-aware routing for hybrid domains (e.g., multimodal event sequence modeling) (Ni et al., 29 Nov 2025).

AdaMoE, as independently realized across neural sequence modeling, decision policy learning, time series forecasting, and online drift adaptation, presents a general framework for context-adaptive, resource-efficient specialization in large models—anchored in precise, decoupled, and often plug-and-play routing innovations.