Dynamic Top-p MoE Routing

Updated 26 March 2026

Dynamic Top-p MoE is a routing paradigm for sparse Mixture-of-Experts that adaptively adjusts expert activation based on model confidence and token complexity.
It employs a dynamic Top-p threshold controlled by a Proportional-Integral (PI) feedback mechanism to maintain strict compute budgets and layer-wise adaptivity.
The method outperforms traditional Top-K and fixed Top-p routing, offering improved resource utilization and enhanced performance in large-scale language and vision models.

Dynamic Top-p Mixture-of-Experts (DTop-p MoE) is a routing paradigm for sparse Mixture-of-Experts (MoE) neural architectures that adaptively determines the number of activated experts per input based on model confidence, with mechanisms for precise sparsity control and layer-wise adaptivity. DTop-p MoE unifies computational efficiency, per-token flexibility, and performance, systematically outperforming conventional Top-K and fixed Top-p routing. It has become a reference solution for scalable model pretraining, large-scale LLMs, and adaptive compute allocation in vision and diffusion transformers (Huang et al., 2024, Jin et al., 16 Dec 2025).

1. Background and Motivations

Standard Top-K routing in MoE systems dispatches every token to exactly $K$ experts per layer, resulting in fixed computational patterns regardless of token or task difficulty. While this approach offers predictable FLOPs and is hardware-efficient, it is suboptimal for tokens that are either simple—underutilizing capacity—or challenging—potentially under-resourced.

Fixed-threshold Top-p MoE, in which the smallest set of experts whose cumulative router probability exceeds a threshold $p$ is activated, offers per-token adaptivity. However, it cannot enforce a strict budget on average expert utilization and exhibits high sensitivity to the choice of $p$ , often overshooting desired computational resources as router entropy evolves through training (Jin et al., 16 Dec 2025).

Dynamic Top-p MoE addresses these limitations by coupling Top-p routing with adaptive threshold control, exact budget guarantees, and robust handling of per-token difficulty. This results in (1) precise compute usage, (2) token- and layer-wise allocative flexibility, and (3) stable and robust model behavior across training and inference regimes.

2. Dynamic Expert Selection Mechanism

The DTop-p MoE selection mechanism replaces the fixed-K rule with a confidence-based dynamic criterion. Let $x$ denote a token’s input representation. The router projects $x$ to routing logits $z = W_r\,x^T$ , from which a softmax yields expert probabilities: $P_i = \frac{\exp((W_r x^T)_i)}{\sum_{j=1}^N \exp((W_r x^T)_j)}, \quad i = 1, \ldots, N$

Experts are ranked by $P_i$ in descending order, and the minimum number $t$ is selected such that the cumulative sum $\sum_{j=1}^t P_{I_j} \geq p$ . If the maximum $P_i$ exceeds $p$ , a single expert is used. The set $S = \{I_1, \ldots, I_t\}$ are the activated experts, with gating weights given by $g_i(x)=P_i$ for $i\in S$ and $g_i(x)=0$ otherwise.

The resulting MoE layer output is: $\mathrm{MoE}(x) = \sum_{i=1}^N g_i(x)\,e_i(x) = \sum_{i\in S} P_i\,e_i(x)$ where $e_i(x)$ denotes the output of expert $i$ .

This method naturally dispatches more experts to "hard" examples (i.e., input tokens with low maximum router confidence and high task complexity) and fewer to "easy" ones, enabling adaptive allocation of capacity without wasting resources (Huang et al., 2024).

3. Threshold Control and Sparsity Management

To resolve the challenge of matching average expert activation to a strict global compute budget, DTop-p MoE incorporates a Proportional-Integral (PI) feedback controller that dynamically tunes the Top-p threshold $\tau_t$ to maintain a target sparsity ratio $s^* = T/N$ (where $T$ is the target expert count).

Specifically, after each batch, the running average of activated experts per token $a_t$ and sparsity $s_t = a_t / N$ are computed. The controller adjusts the threshold: $\tau_t = \tau_{t-1} + K_p\,(s_{t-1} - s^*) + K_i \sum_{i=1}^{t-1} (s_i - s^*)$ with $K_p$ , $K_i$ as controller gains tuned to eliminate bias and balance reactivity versus stability. The threshold is clipped to $(0,1)$ to maintain feasible selection. This loop enforces a precise average expert count, even as routing statistics evolve with model and data scale (Jin et al., 16 Dec 2025).

4. Dynamic Routing Normalization and Layer-wise Adaptation

A single global Top-p threshold does not account for varying entropy and scaling properties of routing logits across model layers. DTop-p MoE addresses this by normalizing the raw logits $z_\ell$ at each layer $\ell$ : $\hat z_\ell = \frac{z_\ell - \mu(z_\ell)}{\sigma(z_\ell)}, \qquad P^{(\ell)} = \text{softmax}(\theta_\ell\,\hat z_\ell)$ where $\theta_\ell$ is a learnable scalar applied per layer. This enables each layer to calibrate the "temperature" of gate distributions, so that a single global $\tau_t$ can still realize diverse per-layer sparsity levels. Early layers typically develop sharper $\theta_\ell$ (activating fewer experts), deeper layers display higher entropy and allocate more experts, all under a strict aggregate sparsity constraint.

5. Algorithmic Workflow

A typical batch-wise forward and threshold-update procedure is as follows:

For each token and layer:
- Project $x_j$ to logits $z$ , normalize and scale logits, compute expert probabilities.
- Sort probabilities, select minimum set $|S|$ so cumulative $P_{S}$ meets/exceeds $\tau_t$ .
- Experts in $S$ are activated with normalized gating weights.
After all tokens and layers:
- Compute $a_t$ , $s_t$ , and error $e_t = s_t - s^*$ .
- Update $\tau_t$ via the PI rule.
Backpropagation proceeds through routing and expert weights, except for $\tau_t$ , which is not differentiated.

Pseudocode for this procedure appears in (Jin et al., 16 Dec 2025).

6. Empirical Results and Scaling Analysis

DTop-p MoE demonstrates consistent improvements over Top-K and fixed-Top-p across model sizes (0.4B–13.6B parameters), datasets (RedPajama, DCLM-Baseline, large-scale vision data), and tasks (PIQA, HellaSwag, ARC, BBH, MMLU, DiT denoising) (Huang et al., 2024, Jin et al., 16 Dec 2025). On a 3.5B parameter, 24-layer Transformer with 16 experts/layer, DTop-p achieves average zero-shot accuracy gains of $+0.7$ points over Top-2 routing, with under $90\%$ of the activated parameters and marking up to $+2.3$ points improvement on tasks requiring advanced reasoning (BBH).

Sparsity remains tightly controlled: with a target of 8 experts (64E8A MoE), DTop-p maintains $8\pm1$ active experts per token throughout training. In contrast, fixed- $p$ Top-p routing can overshoot and activate up to $12$ experts.

Scaling analyses reveal:

Layer-wise allocation: Shallow layers converge to $1$–$2$ experts per token; deep layers to $12$ or more. Top-k and fixed-p cannot replicate this.
Expert granularity: DTop-p’s gains over Top-k increase as the number of experts grows, reaching up to $+2.1$ points at 128E16A.
Model and data regimes: Larger models and datasets amplify performance margins.
Resource precision: DTop-p achieves stable compute use, preventing unexpected FLOPs drift and resource spikes.

7. Implications, Limitations, and Future Directions

DTop-p MoE demonstrates the advantage of integrating controller-based budget enforcement with per-token and per-layer adaptivity. This balance is unattainable with either Top-k or naive Top-p alone. The approach elucidates the hierarchical nature of expert allocation: early network layers benefit from wider ensembles for feature extraction, while later layers economize, reducing overfitting and unnecessary compute.

Limitations include the need to retune controller parameters for new architectures or modalities, and the absence of evaluation at extreme (trillion-scale) parameter and token regimes. Interactions with heterogeneous expert design, differentiable gating, or reinforcement-based routing are pending areas of study. The PI controller could be extended (e.g., to PID or adaptive Bayesian schemes) for faster threshold convergence and automatic hyperparameter selection (Jin et al., 16 Dec 2025).

A plausible implication is that future MoE research may formalize combinatory dynamic-routing architectures combining DTop-p with expert choice, nonuniform expert parameterization, or other advanced gating/topology methods. The mechanism's robust scaling properties position it as a standard for foundation model pretraining and deployment in both language and vision domains.

References:

"Harder Tasks Need More Experts: Dynamic Routing in MoE Models" (Huang et al., 2024)
"Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training" (Jin et al., 16 Dec 2025)
For comparison: "ToMoE: Converting Dense LLMs to Mixture-of-Experts through Dynamic Structural Pruning" (Gao et al., 25 Jan 2025)

Markdown Report Issue Upgrade to Chat

References (3)

Harder Tasks Need More Experts: Dynamic Routing in MoE Models (2024)

Sparsity-Controllable Dynamic Top-p MoE for Large Foundation Model Pre-training (2025)

ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Top-p Mixture-of-Experts (DTop-p MoE).

Dynamic Top-p MoE Routing

1. Background and Motivations

2. Dynamic Expert Selection Mechanism

3. Threshold Control and Sparsity Management

4. Dynamic Routing Normalization and Layer-wise Adaptation

5. Algorithmic Workflow

6. Empirical Results and Scaling Analysis

7. Implications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Dynamic Top-p MoE Routing

1. Background and Motivations

2. Dynamic Expert Selection Mechanism

3. Threshold Control and Sparsity Management

4. Dynamic Routing Normalization and Layer-wise Adaptation

5. Algorithmic Workflow

6. Empirical Results and Scaling Analysis

7. Implications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research