Linear-MoE Architecture

Updated 5 December 2025

Linear-MoE is a sparse mixture-of-experts framework that uses a gating network to assign inputs to linear expert mappings, ensuring efficient data partitioning.
The architecture offers strong universal approximation properties and robust scaling laws for both parametric and nonparametric function approximation.
Innovations like ERMoE routing and shared expert modules enhance expressivity, reduce latency, and maintain balanced utilization in large-scale models.

A Linear-MoE (Mixture-of-Experts with linear experts) architecture couples classical mixture-of-experts modeling with sparse, high-capacity neural architectures via a gating network assigning inputs to expert linear mappings. This approach yields models that combine strong theoretical guarantees, architectural modularity, and scalable, efficient implementation for both parametric and nonparametric function approximation, sequence modeling, and large-scale transformer-based systems. The following sections cover principal structure, theoretical properties, sparse routing, integration into sequence and transformer models, scaling laws, and recent architectural innovations.

1. Core Architecture and Mathematical Specification

A Linear-MoE consists of two primary components: a gating network and a collection of $K$ expert networks, each parameterizing a (vector-valued) linear map. For input $x \in \mathbb{R}^p$ , the model operates as follows:

Gating Network: Computes mixture weights

$g_i(x) = \frac{\exp(w_i^\top x + c_i)}{\sum_{j=1}^K \exp(w_j^\top x + c_j)}, \quad i = 1, \ldots, K$

where $w_i \in \mathbb{R}^p$ and $c_i \in \mathbb{R}$ are gating parameters, enforcing $g_i(x) > 0$ and $\sum_i g_i(x) = 1$ .

Expert Networks: In the scalar-output case, each expert computes a linear response

$f_i(x) = a_i^\top x + b_i, \quad a_i \in \mathbb{R}^p, b_i \in \mathbb{R}$

For multi-output $\mathbb{R}^q$ , experts generalize to

$f_i(x) = A_i x + b_i, \quad A_i \in \mathbb{R}^{q \times p}, b_i \in \mathbb{R}^q$

Model Output: The final prediction is the mixture

$\mu(x) = \sum_{i=1}^K g_i(x) f_i(x)$

or, if modeling conditional densities,

$p(y \mid x) = \sum_{i=1}^K g_i(x) \mathcal{N}(y; A_i x + b_i, C_i)$

where $C_i \in \mathbb{R}^{q \times q}$ is a positive-definite covariance (Nguyen et al., 2017).

The parameter counts scale as follows:

Component	Parameters per Expert	Total Parameters
Gating (softmax)	$w_i \in \mathbb{R}^p$ , $c_i \in \mathbb{R}$	$K \cdot (p+1)$
Expert	$A_i \in \mathbb{R}^{q \times p}$ , $b_i \in \mathbb{R}^q$	$K \cdot q \cdot (p+1)$
Covariance	(optional) $C_i$ : $q(q+1)/2$	$K \cdot q(q+1)/2$

2. Theoretical Properties and Approximation Results

Linear-MoE architectures possess strong universal approximation properties for both mean functions and conditional densities:

Density Approximation: For any collection of true conditional marginal densities $g_{Y_j|X}(y_j|x)$ on compact $X \subset \mathbb{R}^p$ , there exists a Linear-MoE (with sufficiently large $K$ ) that approximates each $p(y_j|x)$ to arbitrary precision in conditional KL-divergence given suitable smoothness and positivity conditions (Nguyen et al., 2017).
Mean-Function Denseness: The model class forms a dense subset in the space of continuous vector-valued functions. For any continuous $\mu: X \to \mathbb{R}^q$ and $\epsilon > 0$ , one can choose $K$ and parameters so that

$\|\hat{\mu} - \mu\|_{q,\infty} = \max_{x \in X} \sum_{j=1}^q |\hat{\mu}_j(x) - \mu_j(x)| < \epsilon$

Thus, Linear-MoE can model arbitrary continuous multivariate regressions (Nguyen et al., 2017).

These properties are enabled by closure under summation (for mean functions) and permutation (for conditional densities) of the MoE class, allowing one to build multivariate approximators by composition of independent univariate MoEs and mixture-of-Gaussians (Nguyen et al., 2017).

3. Sparse Routing and Load Balancing

The canonical sparsity procedure is top- $k$ gating in which, for each token or input, the $k$ experts with highest gating probabilities $g_i(x)$ are selected. The normalized weights for the chosen subset are:

$w_e(x) = \frac{\max\{g_e(x), 0\}}{\sum_{e' \in \mathcal{S}_x} \max\{g_{e'}(x), 0\}}$

Strict capacity constraints are enforced by this per-token top- $k$ selection (Harvey et al., 19 Jun 2025, Sun et al., 7 Mar 2025).

To promote balanced expert usage and avoid "stragglers," an auxiliary load-balancing loss term is often introduced:

$\mathcal{L}_{\mathrm{aux}} = \alpha_{\mathrm{aux}} E \sum_{e=1}^E f_e g_e \;\;\text{or}\;\; = \alpha_{\mathrm{aux}} E \sum_{e=1}^E (g_e)^2 f_e$

where $f_e$ is the fraction of tokens routed to expert $e$ , $g_e$ is the average routing probability, and $\alpha_{\mathrm{aux}}$ is a hyperparameter (Harvey et al., 19 Jun 2025). This regularizer encourages uniform utilization but can degrade specialization if set too strongly (Cheng et al., 14 Nov 2025).

4. Integration with Sequence and Transformer Architectures

The modern instantiations of Linear-MoE deploy MoE layers within various high-throughput sequence modeling and Transformer frameworks:

Linear Sequence Modeling (LSM) + MoE: Linear-MoE integrates linear-complexity sequence modules including linear attention ( $O(N)$ ), structured state-space models (SSMs), and linear RNNs, chaining each block as:

$\mathrm{LN} \rightarrow \mathrm{LSM} \rightarrow + \rightarrow \mathrm{LN} \rightarrow \mathrm{MoE} \rightarrow +$

All MoE layers share the gating/routing machinery above, while the LSM component provides the token-mixing operator, facilitating $O(N)$ runtime and efficient parallelism (Data, Tensor, Pipeline, Expert, Sequence Parallelism) (Sun et al., 7 Mar 2025).

Shared Experts in Attention and FFN: The UMoE model reformulates multi-head attention as a MoE sublayer, enabling shared expert modules across both attention and FFN and efficient parameter reuse. Each expert block is a two-layer FFN, and routers can be separated for attention and FFN sub-modules or tied for additional savings (Yang et al., 12 May 2025).
Hybrid Linear-MoE/Transformer-MoE: For tasks demanding inductive biases from both softmax attention and LSM, hybrid models alternate Linear-MoE and standard Transformer-MoE blocks, using the same routing/gating layers (Sun et al., 7 Mar 2025).

5. Scaling Laws, Efficiency, and Architectural Tradeoffs

Scaling theory for Linear-MoE predicts compute advantage and capacity via the Efficiency Leverage (EL) metric:

$EL(X_{\mathrm{MoE}} \mid X_{\mathrm{Dense}}; C_{\mathrm{target}}) = \frac{C_{\mathrm{dense}}}{C_{\mathrm{moe}}}$

where $C$ is the training compute required to reach the same loss (up to a small $\epsilon$ ) (Tian et al., 23 Jul 2025). Key empirical findings:

Activation Ratio ( $A$ ): The fraction of experts activated per token ( $A = K / E$ ) primarily determines EL, with $EL \propto A^{-a}$ for $a \in [1, 1.5]$ ; lower $A$ (greater sparsity) yields higher leverage.
Expert Granularity ( $G$ ): Defined as $G = 2 d_{\mathrm{model}} / d_{\mathrm{expert}}$ ; optimal EL emerges in the range $G \approx 8$ –$12$ due to a U-shaped dependence in $\log EL(\log G)$ .
Compute Budget ( $C$ ): Scaling EL as a power law in $C$ , with modest increases for large budgets.
Unified Law:

$EL(A, G, C) = \bar{A}^{a + d \log C + \gamma (\log G)^2 + \beta \log G}$

Coefficients are empirically fitted (see Table 3, (Tian et al., 23 Jul 2025)). This law accurately predicts that, for example, a model with $A=3.4\%$ , $G=12$ , and $C=1{\times}10^{22}$ attains $EL \approx 7$ , consistent with Ling-mini-beta ($0.85$B active params) matching dense-6.1B with $1/7$ the training FLOPs.

6. Comparative Routers and Recent Innovations

Several routing mechanisms have been studied for their trade-offs in expressivity, efficiency, and stability:

Linear Router: Minimal overhead, a single affine projection ( $\sim6$ k params for typical settings), extremely fast routing (0.07 ms/token), moderate entropy, and smooth, uniform expert usage (Harvey et al., 19 Jun 2025). However, limited semantic awareness and nonlinear selectivity compared to multilayer alternatives.
Attention and MLP Routers: Offer greater expressivity, higher routing entropy ( $\sim2.08$ bits vs. $1.95$ for Linear), and improved feature-space partitioning, but incur 4–16× parameter and latency costs (Harvey et al., 19 Jun 2025).
Eigenbasis/ERMoE Routing: ERMoE replaces gating logits with a cosine similarity (“Eigenbasis Score”) between the input and a learned orthonormal basis for each expert. This content-aware, geometry-based routing achieves highly stable expert utilization, obviates balancing losses, and produces anatomically or semantically interpretable specializations (Cheng et al., 14 Nov 2025). ERMoE achieves state-of-the-art accuracy on ImageNet, COCO, and clinical imaging benchmarks, and in 3D MRI brain-age prediction achieves 2.31 MAE vs. 2.83 for the best dense model.
UMoE Shared Experts: Demonstrates that simultaneous expert-sharing across attention and FFN unlocks additional parameter efficiency and outperforms dense or prior MoE architectures in both perplexity and downstream accuracy (Yang et al., 12 May 2025).

7. Practical Implementation and Empirical Results

Linear-MoE systems support state-of-the-art empirical performance in production-scale settings. Key workflow steps include:

Selection of $K$ (number of experts): Chosen via information criterion or cross-validation to balance approximation power, overfitting, and computational footprint (Nguyen et al., 2017).
Parametrization and Training: Gating network (softmax or geometry-based), linear/MLP experts, and, if needed, auxiliary regularizers (for classic Linear routing); all parameters trained jointly with gradient or EM-based optimization (Sun et al., 7 Mar 2025).
Parallel Execution: Leveraging advanced parallelism (Data/Tensor/Pipeline/Expert/Sequence) for strong-scaling efficiency at long sequence lengths and large parameter counts (Sun et al., 7 Mar 2025).
Empirical Benchmarks: Across multiple model sizes (e.g., A0.3B–2B, A1B–7B), Linear-MoE attains efficiency gains (up to $2\times$ inference speed and $O(1)$ memory with up to 16K context) with accuracy competitive to dense and baseline MoE architectures (Sun et al., 7 Mar 2025, Tian et al., 23 Jul 2025). In vision and language tasks, modern Linear-MoE architectures consistently match or outperform their dense counterparts using $7\times$ less compute (Tian et al., 23 Jul 2025).

Summary Table: Router Variants in Linear-MoE

Router Type	Param Count (E=8, d=768)	Entropy (H(P))	Latency (ms/token)	Notable Property
Linear	6,144	1.95	0.07	Minimal, classical, stable
Attention	49,664	2.08	0.29	Embedding-based, higher awareness
MLP	101,000	2.08	0.23	Nonlinear, expressive
ERMoE (basis)	$O(d^2)$ (basis, small)	Flattest	0.3–0.4	Content-aware, interpretable
MLP-Hadamard	101,000	1.10	0.88	Structured, sharp two-expert splits
Hash	0	0	85.0	Deterministic, not used in practice

This comparative summary underscores both the efficiency–expressivity frontier and recent advances in geometry- or content-aware routing.

References

(Nguyen et al., 2017) Approximation results regarding the multiple-output mixture of linear experts model
(Sun et al., 7 Mar 2025) Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts
(Yang et al., 12 May 2025) UMoE: Unifying Attention and FFN with Shared Experts
(Harvey et al., 19 Jun 2025) Optimizing MoE Routers: Design, Implementation, and Evaluation in Transformer Models
(Tian et al., 23 Jul 2025) Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts LLMs
(Cheng et al., 14 Nov 2025) ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization