Mixture-of-Experts Feedforward Layers

Updated 3 April 2026

MoE feedforward layers are a modular approach that replaces standard dense MLPs with a sparse mix of expert networks using dynamic routing.
They decouple parameter count from per-token computation, enabling scalable model capacity and specialized performance in tasks like language and vision.
Advanced designs incorporate top-K selection, load balancing, and tensor factorization to optimize efficiency, accuracy, and adaptability.

A mixture-of-experts (MoE) feedforward layer is a modularized subnetwork within modern deep learning architectures—most commonly Transformers for language, vision, and multimodal applications—in which the standard dense feedforward block is replaced by a sparse, conditionally activated combination of multiple “expert” networks, with a separate routing function (the “router” or gating network) selecting a small subset of experts for each input. This design paradigm enables scaling model capacity far beyond the limitations of dense computation, decoupling parameter count from per-token computational cost and memory, and allowing for specialization of subnetworks to distinct data regimes or tasks. MoE-based feedforward layers have become foundational in LLMs, vision transformers, advanced universal transformers, and various other domains.

1. Formal Structure of MoE Feedforward Layers

The canonical MoE feedforward layer replaces the classical two-layer MLP

$F(x) = W_2\,\sigma(W_1\,x + b_1) + b_2$

with a modular construction consisting of $E$ experts, each being an independent, typically two-layer MLP: $E_i(x) = W_2^i \,\sigma(W_1^i\,x + b_1^i) + b_2^i, \quad i = 1,\dots,E$ and a router network $R(x)$ assigning each input a set of weights or selection probabilities over experts. The general MoE output is

$\operatorname{MoE}(x) = \sum_{i=1}^E g_i(x)\,E_i(x)$

where $g_i(x)$ is determined by the routing mechanism (softmax, top- $K$ , sigmoid, etc.) and often only $K \ll E$ of the $E$ experts are actually activated per input, achieving conditional computation (Eigen et al., 2013, Riquelme et al., 2021).

Sparse Routing and Top- $K$ Mechanisms

Modern designs enforce sparsity via top- $E$ 0 selection: only the $E$ 1 largest $E$ 2 are nonzero. Softmax or alternative activations (see Section 4) are used for score normalization: $E$ 3 with $E$ 4 the set of top- $E$ 5 indices.

2. Gating Networks and Routing Strategies

The essential hallmark of MoE FFN layers is the decoupling of computation from parameter storage by routing. The gating network $E$ 6 computes expert-selection scores via one of several paradigms:

Linear or small MLP transformations, optionally with non-linear activations, producing $E$ 7 routing logits (Zheng et al., 30 Sep 2025).
Softmax-based probabilistic routing: $E$ 8; introduces normalization to the probability simplex. This is historically standard (Riquelme et al., 2021, Shen et al., 16 Oct 2025).
Alternative router activations: ReLU, sigmoid, kernel-derived routers, or product-key based (see Section 4).
Residual or context-enhanced routing: e.g., MoE++ employs gating residuals from previous routing layers to enhance sequential consistency (Jin et al., 2024).
Task/token-wise routers: Specialized gating for task-level or token-level multiplexing, facilitating multi-task and multi-modal specialization (Wu et al., 5 Jun 2025).
Progressive Scaling Routing (PSR): Dynamic expansion of the candidate expert pool during training to encourage balanced and diverse routing (Tan et al., 20 Oct 2025).

Auxiliary regularizations—such as load-balancing losses—are commonly introduced to ensure expert utilization and avoid collapse to a single dominant expert (Riquelme et al., 2021, Sukhbaatar et al., 2024, Shen et al., 16 Oct 2025).

3. Architectures, Factorizations, and Scaling

MoE feedforward layers admit substantial architectural diversity beyond basic sparse-expert designs.

Traditional MoE

The classic regime instantiates $E$ 9 full MLPs as experts. Each token, via the router, is assigned to a sparse subset of experts (typically $E_i(x) = W_2^i \,\sigma(W_1^i\,x + b_1^i) + b_2^i, \quad i = 1,\dots,E$ 0 or $E_i(x) = W_2^i \,\sigma(W_1^i\,x + b_1^i) + b_2^i, \quad i = 1,\dots,E$ 1), leading to a total computational cost scaling as $E_i(x) = W_2^i \,\sigma(W_1^i\,x + b_1^i) + b_2^i, \quad i = 1,\dots,E$ 2 per-token, with parameter count scaling as $E_i(x) = W_2^i \,\sigma(W_1^i\,x + b_1^i) + b_2^i, \quad i = 1,\dots,E$ 3 (Riquelme et al., 2021).

Knowledge Factorization and Subnetwork Reuse

Techniques such as FactorLLM (Zhao et al., 2024) demonstrate that a pretrained dense FFN can be partitioned into $E_i(x) = W_2^i \,\sigma(W_1^i\,x + b_1^i) + b_2^i, \quad i = 1,\dots,E$ 4 non-overlapping "experts" merely by permuting and chunking the weights along the hidden dimension. This allows immediate drop-in construction of an MoE layer, preserving the dense performance with negligible additional parameters and no pretraining of experts. Fine-tuning with a lightweight router and Prior-Approximation loss achieves high performance retention and significant computational acceleration.

Multilinear and Tensor Factorizations

$E_i(x) = W_2^i \,\sigma(W_1^i\,x + b_1^i) + b_2^i, \quad i = 1,\dots,E$ 5MoE layers use a high-order tensor for the expert bank, with the weight tensor never materialized explicitly but held in low-rank tensor (CP, Tucker, Tensor-Train) decompositions. This yields compact, scalable, and differentiable MoE layers with thousands of sparse experts and no non-differentiable routing or top- $E_i(x) = W_2^i \,\sigma(W_1^i\,x + b_1^i) + b_2^i, \quad i = 1,\dots,E$ 6 selection (Oldfield et al., 2024).

Infinite and Ultra-Fine-Grained Expert Banks

$E_i(x) = W_2^i \,\sigma(W_1^i\,x + b_1^i) + b_2^i, \quad i = 1,\dots,E$ 7-MoE generalizes the expert space to a continuous domain, with each token stochastically sampling expert indices and each index corresponding to a sparse masking over a shared FFN—a true infinite mixture (Takashiro et al., 25 Jan 2026). Similarly, PEER (He, 2024) implements efficient retrieval from a pool of over one million tiny experts using product-key hashing, with per-token activation fixed at a manageable budget.

Expert Reuse and Layer-Crossing Routing

ReXMoE (Tan et al., 20 Oct 2025) overcomes the limitations of layer-local routing by allowing routers to select experts from a cross-layer shared pool, significantly enriching the combinatorial potential of expert activation without increasing parameter budget. Progressive scaling curricula address expert under-utilization during routing expansion.

4. Router Function Design and Theoretical Analysis

The selection of the router function is intimately linked to expressivity, trainability, and utilization.

Alternative Routers

Kernel-inspired Routers (KERN): The KERN router interprets expert selection as kernel regression, unifying softmax (exponential kernel), sigmoid, and FFN-like (ReLU + $E_i(x) = W_2^i \,\sigma(W_1^i\,x + b_1^i) + b_2^i, \quad i = 1,\dots,E$ 8 normalization) gating. KERN demonstrates robust performance, balanced expert gradients, and no extra computational overhead (Zheng et al., 30 Sep 2025).
Residual and Zero-Computation Experts: MoE++ proposes augmenting standard FFNs with zero-cost experts (discard, copy/skip, constant vector adjustment), allowing massive reductions in active compute without degrading model capacity (Jin et al., 2024).
Collaboration Mechanisms: AdaMoE decouples expert selection (with the router) from expert weighting (with a scale adapter), allowing for collaborative expert activation and improved specialization (Shen et al., 16 Oct 2025).

Theoretical Expressivity

Recent theoretical work quantifies the function class MoE layers can represent. Shallow MoEs efficiently approximate functions supported on low-dimensional manifolds, overcoming the curse of dimensionality. Deep MoEs stack layers with $E_i(x) = W_2^i \,\sigma(W_1^i\,x + b_1^i) + b_2^i, \quad i = 1,\dots,E$ 9 experts to exponentially increase the number of representable "pieces" (up to $R(x)$ 0 for $R(x)$ 1 layers), critical for tasks with compositional or locally sparse structure (Wang et al., 30 May 2025). Architectures with nonlinear gating achieve higher partitioning efficiency.

5. Computational and Efficiency Trade-offs

Conditional computation in MoE FFN layers underpins their scalability.

Architecture	Parameter Count	Per-Token Compute	Scaling Characteristics
Dense FFN	$R(x)$ 2	$R(x)$ 3	Capacity ∝ compute and memory
Sparse MoE ( $R(x)$ 4 of $R(x)$ 5)	$R(x)$ 6	$R(x)$ 7	Capacity $R(x)$ 8 compute per token
Factorized MoE ( $R(x)$ 9 rank)	$\operatorname{MoE}(x) = \sum_{i=1}^E g_i(x)\,E_i(x)$ 0	$\operatorname{MoE}(x) = \sum_{i=1}^E g_i(x)\,E_i(x)$ 1	Tunable by factorization rank
Infinite MoE ( $\operatorname{MoE}(x) = \sum_{i=1}^E g_i(x)\,E_i(x)$ 2, ratio $\operatorname{MoE}(x) = \sum_{i=1}^E g_i(x)\,E_i(x)$ 3)	$\operatorname{MoE}(x) = \sum_{i=1}^E g_i(x)\,E_i(x)$ 4FFN $\operatorname{MoE}(x) = \sum_{i=1}^E g_i(x)\,E_i(x)$ 5	$\operatorname{MoE}(x) = \sum_{i=1}^E g_i(x)\,E_i(x)$ 6	Trade accuracy-speed at runtime
PEER (million experts)	$\operatorname{MoE}(x) = \sum_{i=1}^E g_i(x)\,E_i(x)$ 7 (tiny experts)	$\operatorname{MoE}(x) = \sum_{i=1}^E g_i(x)\,E_i(x)$ 8, $\operatorname{MoE}(x) = \sum_{i=1}^E g_i(x)\,E_i(x)$ 9	Performance scales with $g_i(x)$ 0 for fixed compute

Efficient batch assignment, load balancing, and auxiliary loss design are essential to achieve both high-expert-utilization and predictable throughput. MoE architectures enjoy up to $g_i(x)$ 1 reductions in inference FLOPs compared to matched dense baselines, often with $g_i(x)$ 2 accuracy loss, and—in settings such as PEER or $g_i(x)$ 3-MoE—can maintain or even improve accuracy while scaling to unprecedented numbers of experts (Zhao et al., 2024, He, 2024, Takashiro et al., 25 Jan 2026).

6. Applications, Specializations, and Recent Innovations

MoE FFN layers have been pivotal in:

LLMs: Conditional compute enables training and deploying LLMs with trillions of parameters without prohibitive compute cost. Techniques such as FactorLLM and BTX enable efficient knowledge specialization, domain adaptation, and rapid mixture construction (Zhao et al., 2024, Sukhbaatar et al., 2024).
Vision Transformers: V-MoE and μMoE demonstrate that MoE FFNs yield parameter scaling, compute savings, and improved scaling laws for image recognition tasks (Riquelme et al., 2021, Oldfield et al., 2024).
Multi-modal and Multi-task Models: T2MIR and AdaMoE integrate token-wise and task-wise MoEs for in-context RL and collaborative routing in vision-language-action tasks, achieving substantial gains in task return and real-world experimental benchmarks (Wu et al., 5 Jun 2025, Shen et al., 16 Oct 2025).
Universal Transformers: MoEUT combines fine-grained MoE FFN layers with shared, recurrent Universal Transformer blocks, outperforming parameter-matched dense Universal Transformers on language modeling and zero-shot evaluations (Csordás et al., 2024).

MoE++ further introduces heterogeneity with zero-computation experts—improving both compute efficiency and deployment friendliness by reducing cross-device communication (Jin et al., 2024).

7. Open Problems and Future Directions

Expert specialization and modularity: Achieving and maintaining semantically coherent specialization among a large expert bank remains a design and training challenge, with ongoing work in explicit specialization and gating design (Oldfield et al., 2024, Sukhbaatar et al., 2024).
Load balancing and utilization: Routing collapse is a persistent problem in large-scale MoEs, motivating work on advanced regularizers, collaborative routers (e.g., AdaMoE), and progressive curriculum strategies (PSR) (Shen et al., 16 Oct 2025, Tan et al., 20 Oct 2025).
Continuous and hierarchical experts: The move toward infinite or hierarchical expert spaces points toward more fluid and fine-grained parameter allocation, circumventing the data sparsity and efficiency bottlenecks of large discrete banks (Takashiro et al., 25 Jan 2026, Oldfield et al., 2024).
Practical deployment: Engineering optimized kernels for sparse expert lookup (as in PEER), memory-aware expert allocation, and zero-computation experts for deployment on heterogeneous hardware are active areas (He, 2024, Jin et al., 2024).

There is converging evidence that mixture-of-experts FFN layers represent not just an efficiency measure, but a new substrate for scalable, interpretable, and modular deep learning, enabling tractable training and inference at scales and specialization levels previously unreachable by monolithic architectures (Zhao et al., 2024, He, 2024, Tan et al., 20 Oct 2025, Wang et al., 30 May 2025, Oldfield et al., 2024).