Dynamic-Capacity MoE Models

Updated 26 February 2026

Dynamic-Capacity MoE models are architectures that adjust the number and type of activated experts per token based on input complexity and resource constraints.
They utilize dynamic gating, capacity-aware token assignment, and heterogeneous expert pools to efficiently balance computational load with task demands.
These models achieve enhanced throughput, parameter efficiency, and adaptability across multimodal applications while reducing memory and latency costs.

Dynamic-Capacity MoE models constitute a class of Mixture-of-Experts (MoE) architectures in which the number and/or type of activated experts—hence, the effective capacity—varies dynamically depending on input tokens, resource constraints, task modality, runtime context, or distributional shifts. Unlike static-capacity MoE models, which allocate a fixed subset of experts per token and maintain homogeneous expert size, dynamic-capacity MoEs employ adaptive routing and expert management to more efficiently align computational resources with content complexity or system constraints. Techniques include dynamic gating (e.g., Top-p/Top-K with adaptive thresholds), capacity-aware expert assignment/dropping, heterogeneous experts, hierarchical/stratified routing, and resource-aware aggregation and offloading. This paradigm has been demonstrated to improve computational efficiency, parameter utilization, domain and task adaptability, and deployability in both cloud and edge environments, across language, vision, speech, and multimodal settings.

1. Core Principles and Definitions

Dynamic-capacity MoE refers to any sparsely activated expert framework in which the set or number of active experts per token (per batch or per time window) is not fixed a priori, but is computed on the fly via learned gating or system-level scheduling. Key variants include:

Token/Instance-Level Dynamic Routing: The router selects a variable number of experts for each token based on gating probabilities exceeding a Top-p threshold, as in Uni-MoE-2.0-Omni (Li et al., 16 Nov 2025), UniMoE-Audio (Liu et al., 15 Oct 2025), DTop-p MoE (Jin et al., 16 Dec 2025), and Grove MoE (Wu et al., 11 Aug 2025).
Capacity-Adaptive Inference: The model enforces per-expert capacity constraints (e.g., $\gamma\bar{N}$ ), using mechanisms such as token drop, reroute, or overflow discarding to mitigate imbalanced assignment and the "straggler effect" (He et al., 7 Mar 2025).
Device/Resource-Aware Capacity: Approaches like SwapMoE (Kong et al., 2023) and CoMoE (Li et al., 10 Aug 2025) dynamically limit the resident expert set in memory or adjust aggregation/offloading policies to satisfy real-time device or memory constraints.
Expert-Level Heterogeneity: Models such as Grove MoE (Wu et al., 11 Aug 2025) maintain variable expert sizes and share smaller "adjugate" experts within expert groups to induce per-token capacity allocation depending on the router's (often complexity-sensitive) scores.
Dynamic Specialization and Expansion: DynamicMoE (Kim, 24 Nov 2025) and DES-MoE (Li et al., 21 Sep 2025) permit growing or replicating experts in response to distributional shifts or new domains, while restricting fine-tuning to domain-relevant expert subsets.

With these mechanisms, dynamic-capacity MoEs decouple total model capacity from per-token compute, allowing models to scale up parameter counts while maintaining (or even reducing) the compute and memory cost for any given inference.

2. Mechanisms for Dynamic Expert Activation

Contemporary dynamic-capacity MoE models implement a range of expert routing and activation mechanisms:

Top-p (Cumulative Probability) Routing: The router softmax outputs are sorted, and the minimal set of experts whose cumulative gating probability exceeds $p$ is selected. This allows for variable, token-dependent sparsity; $k_i$ may range from 1 (or 0, with the use of null experts) up to $N$ (Li et al., 16 Nov 2025, Liu et al., 15 Oct 2025, Jin et al., 16 Dec 2025). Uni-MoE-2.0-Omni and UniMoE-Audio combine this with hybrid expert pools (routed, shared, null).
Controller-Based Adaptive Sparsity: DTop-p MoE (Jin et al., 16 Dec 2025) introduces a proportional-integral (PI) controller to adjust the global Top-p threshold $p_t$ online, maintaining the average number of activated experts per token at a pre-specified target. Per-layer routing normalization ( $\theta_l$ -scaled softmax over standardized logits) allows each layer to calibrate its own effective sparsity.
Capacity-Aware Token Drop/Reroute: Capacity-Aware Inference (He et al., 7 Mar 2025) imposes hard caps $C$ on expert loads, drops overflow tokens based on router scores, and recirculates dropped tokens to underutilized experts. This can enforce efficiency and load balance without retraining, improving throughput with minimal accuracy loss.
Top-k, Stratified, and Hierarchical Gating: Stratified MoE (Xu et al., 2023) employs L-level hierarchical gating with tokens progressing through strata until assigned to a stratum-specific expert, leading to variable computation depth and capacity per token.
Resource-Constrained Virtualization: SwapMoE (Kong et al., 2023) and CoMoE (Li et al., 10 Aug 2025) keep only a subset of the total experts — "virtual experts" (SwapMoE) or fused/aggregated experts (CoMoE) — dynamically resident in device memory, using activation histories or prediction to update this subset in sync with routing demands and device state.

The following table summarizes key mechanisms and their control parameters:

Model / Method	Dynamicity Axis	Control Variable(s)
Uni-MoE-2.0-Omni	Per-token, Top-p	Cumulative $p$ , null/shared experts
DTop-p MoE	Per-token, per-layer	PI-controlled $p_t$ , $\theta_l$
Capacity-Aware Inference	Per-expert, per-batch	Cap. $\gamma\bar{N}$ , token drop/reroute
SwapMoE / CoMoE	Memory/resource	$m_l\leq n_l$ , expert aggregation/offload
Grove MoE	Expert-size, grouping	Router scores, group adjugates

3. Architectures and Model Designs

While the canonical MoE layer replaces the standard feed-forward network in Transformer-style models with a sparsely activated ensemble of experts, dynamic-capacity MoEs extend this basic construction via:

Hybrid Expert Pools: Many models now partition the experts in each MoE layer into (i) routed experts, which are dynamically activated; (ii) always-on shared experts, sized down to ensure universal capacity per token; and (iii) null experts, which represent a learnable (or parameter-free) zero-output computation slot for computation skipping (Li et al., 16 Nov 2025, Liu et al., 15 Oct 2025).
Hierarchical or Stratified Routing: SMoE (Xu et al., 2023) arranges experts into multiple strata, with higher strata offering more specialized or costly computation. The router traverses these levels, assigning more difficult tokens to deeper strata, thereby effecting a dynamic "depth" per token.
Heterogeneous/Sparse Parametrization: Grove MoE (Wu et al., 11 Aug 2025) explicitly conditions expert size on group membership and reuses partial computations (adjugate experts) to reduce incremental compute when multiple experts in the same group are activated.
Growth and Cloning: DynamicMoE (Kim, 24 Nov 2025) and DES-MoE (Li et al., 21 Sep 2025) expand the set of experts in response to detected distribution shifts (manually, via curriculum boundaries or via statistics). DES-MoE further isolates expert updates for each domain, constructing a domain-expert mapping matrix and duplicating multi-domain experts as necessary.

In deployment and resource-constrained settings, architectures integrate dynamic loading/swapping of expert weights, capacity-adaptive caching in GPU/CPU tiers, and prediction-guided scheduling (Kong et al., 2023, Li et al., 10 Aug 2025).

4. Training Paradigms and Capacity Calibration

Dynamic-capacity MoEs introduce several training-time challenges, especially for gradient flow through non-differentiable dynamic routing operators and for ensuring balanced expert utilization:

Gradient Estimation: Dynamic Top-p selection is non-differentiable. Uni-MoE-2.0-Omni (Li et al., 16 Nov 2025) employs straight-through estimators augmented by Euler/Heun ODE solvers for unbiased backpropagation through Top-p gating masks.
Auxiliary Losses: Load-balancing losses (entropy maximization or mean-square balancing) are used across works to regularize the router toward uniform expert utilization and prevent "expert collapse" (Liu et al., 15 Oct 2025, Xu et al., 2023).
Curriculum and Multi-Phase Training: Multi-stage pipelines, e.g., UniMoE-Audio (Liu et al., 15 Oct 2025) and DES-MoE (Li et al., 21 Sep 2025), separate domain-specific pre-training of proto-experts (ensuring specialist competence), integration/warmup of MoE-specific modules (router, shared experts), and synergistic global fine-tuning with scheduled or adaptive freezing of weights. DES-MoE further leverages distillation from frozen routers and progressive gradient masking to focus expert updates.
Capacity Targeting and Control Loops: DTop-p MoE (Jin et al., 16 Dec 2025) employs a PI controller alongside per-layer normalization, dynamically adjusting the routing threshold per batch to maintain a prespecified sparsity target, thus providing robust and compute-efficient training independent of the variability in routing statistics across tasks, model scales, or training phases.

Quantitative results indicate that such calibration can yield best-in-class parameter efficiency: for instance, DTop-p MoE achieves $\sim$ 1.9-point improvements in zero-shot accuracy over Top-k at matched compute (Jin et al., 16 Dec 2025), while SMoE achieves equivalent or better BLEU scores than much larger vanilla MoEs in machine translation (Xu et al., 2023).

5. Memory, Latency, and Deployment Considerations

Dynamic-capacity MoEs have enabled MoE inference in previously infeasible environments by supporting tunable parameter residency, adaptive expert aggregation, and intelligent offloading. Exemplary approaches include:

Virtual Expert Subsetting: SwapMoE (Kong et al., 2023) and CoMoE (Li et al., 10 Aug 2025) implement mechanisms for dynamically keeping only the most "important" experts — as scored by cumulative router output or predicted need — in high-speed device memory. The rest are swapped in/out asynchronously or fetched from disk/CPU.
Prefetching and Multitier Scheduling: CoMoE deploys next-layer expert prediction (two-layer MLP) to anticipate future needs and trigger efficient prefetch, exploiting overlay between compute and I/O.
Joint Aggregation/Offload Optimization: CoMoE introduces an explicit multi-objective optimization to select among candidate expert aggregation (fusion) and offloading (prefetch, eviction) policies, subject to device memory and performance constraints and network volatility.
Resource Adaptation: At run-time, CoMoE can switch between different aggregated MoE models (with different numbers of retained experts per layer) in response to significant device or network resource changes, ensuring continued inference availability while maximizing throughput and accuracy.

Memory reductions up to 70%, with inference latency improvements of 10.5% over prior offloading methods, are reported on edge deployments of large LLMs (e.g., Switch-Base-128) using these techniques (Li et al., 10 Aug 2025). SwapMoE, on Jetson-class hardware, enables LLMs with >32 experts to run with only 20–33% of the original memory footprint, at minimal accuracy drop (Kong et al., 2023).

6. Application Domains and Performance Impact

Dynamic-capacity MoEs have been deployed and validated in diverse domains:

Foundation Models (Text/Lang): Uni-MoE-2.0-Omni (Li et al., 16 Nov 2025), Grove MoE (Wu et al., 11 Aug 2025), and DES-MoE (Li et al., 21 Sep 2025) report state-of-the-art or competitive performance on more than 80 multimodal and multi-domain benchmarks, including reasoning, coding, law, and translation. Dynamic capacity allows these models to adapt to domain shifts, prevent catastrophic forgetting, and retain generalist abilities under continual fine-tuning.
Multimodal Generation and Understanding: Hybrid dynamic-topology MoEs with routed/shared/null experts enable adaptive cross-modal alignment and efficient generation in text-image, video, and speech domains (Li et al., 16 Nov 2025, Liu et al., 15 Oct 2025).
Speech and Audio: SpeechMoE (You et al., 2021) utilizes dynamic one-hot expert routing, mean-importance balancing, and sparse-load regularization for improved parameter utilization and character error rate at matched compute.
Continual Learning and RL: DynamicMoE (Kim, 24 Nov 2025) demonstrates near-zero loss of plasticity after severe domain shifts by adding experts at every phase boundary, exploiting the modularity of the gating mechanism and the parameter efficiency of bottlenecked experts.
Memory/Budget-Limited Inference: SwapMoE and CoMoE (Kong et al., 2023, Li et al., 10 Aug 2025) validate dynamic-capacity deployment strategies in real-time, memory-constrained settings, extending viability of SOTA MoEs to edge devices.

7. Parameter Efficiency, Trade-offs, and Open Challenges

Dynamic-capacity MoEs consistently demonstrate higher parameter efficiency, throughput, and flexibility relative to static MoE and dense models, with representative results summarized below:

Metric/Scenario	Static MoE	Dynamic-Capacity MoE	Relative Impact
Throughput (Uni-MoE-2.0-Omni)	Baseline (Top-2)	1.7 $\times$ (dynamic Top-p + null)	$\sim$ 40% FLOP reduction, similar or better accuracy (Li et al., 16 Nov 2025)
BLEU (MT translation, SMoE)	32.00	32.93 ( $\sim$ 8% higher/FLOPs)	Matches larger static MoE at half the parameter count (Xu et al., 2023)
Param. Usage (Grove MoE)	Fixed per-token	3.14–3.28B/token av., token-dependent	5–20% lower avg FLOPs, higher reasoning accuracy (Wu et al., 11 Aug 2025)
Memory cost (SwapMoE, CoMoE)	14.2 GB	4.7 GB	67–70% reduction, minimal accuracy loss (Kong et al., 2023, Li et al., 10 Aug 2025)
Accuracy loss (Cap-Aware Inference)	—	<1% (vs. capped), –2% (hard drop)	–30% straggler-latency, high throughput (He et al., 7 Mar 2025)

Trade-offs include tuning complexity (PI controller gains, load balancer coefficients), potential for expert collapse in poor regularization regimes, variable per-token latency, and, in growth-based models, unbounded parameter count if expert pruning is not implemented.

Open challenges concern: fully online distribution shift detection for expert growth/pruning; further scaling to trillion-parameter MoEs; fusion of multi-granular token complexity signals in routing; and harmonizing expert activation with hardware/caching constraints for frontier AI systems.

The field of Dynamic-Capacity MoE continues to evolve, with ongoing research targeting optimally adaptive expert selection, improved routing regularization, hybrid/hierarchical expert pools, and seamless integration with multimodal, continually learning, and resource-heterogeneous environments (Li et al., 16 Nov 2025, Jin et al., 16 Dec 2025, He et al., 7 Mar 2025, Li et al., 10 Aug 2025, Wu et al., 11 Aug 2025, Xu et al., 2023, Liu et al., 15 Oct 2025, You et al., 2021, Kim, 24 Nov 2025, Li et al., 21 Sep 2025).