Papers
Topics
Authors
Recent
2000 character limit reached

Linear-MoE Architecture

Updated 5 December 2025
  • Linear-MoE is a sparse mixture-of-experts framework that uses a gating network to assign inputs to linear expert mappings, ensuring efficient data partitioning.
  • The architecture offers strong universal approximation properties and robust scaling laws for both parametric and nonparametric function approximation.
  • Innovations like ERMoE routing and shared expert modules enhance expressivity, reduce latency, and maintain balanced utilization in large-scale models.

A Linear-MoE (Mixture-of-Experts with linear experts) architecture couples classical mixture-of-experts modeling with sparse, high-capacity neural architectures via a gating network assigning inputs to expert linear mappings. This approach yields models that combine strong theoretical guarantees, architectural modularity, and scalable, efficient implementation for both parametric and nonparametric function approximation, sequence modeling, and large-scale transformer-based systems. The following sections cover principal structure, theoretical properties, sparse routing, integration into sequence and transformer models, scaling laws, and recent architectural innovations.

1. Core Architecture and Mathematical Specification

A Linear-MoE consists of two primary components: a gating network and a collection of KK expert networks, each parameterizing a (vector-valued) linear map. For input xRpx \in \mathbb{R}^p, the model operates as follows:

  • Gating Network: Computes mixture weights

gi(x)=exp(wix+ci)j=1Kexp(wjx+cj),i=1,,Kg_i(x) = \frac{\exp(w_i^\top x + c_i)}{\sum_{j=1}^K \exp(w_j^\top x + c_j)}, \quad i = 1, \ldots, K

where wiRpw_i \in \mathbb{R}^p and ciRc_i \in \mathbb{R} are gating parameters, enforcing gi(x)>0g_i(x) > 0 and igi(x)=1\sum_i g_i(x) = 1.

  • Expert Networks: In the scalar-output case, each expert computes a linear response

fi(x)=aix+bi,aiRp,biRf_i(x) = a_i^\top x + b_i, \quad a_i \in \mathbb{R}^p, b_i \in \mathbb{R}

For multi-output Rq\mathbb{R}^q, experts generalize to

fi(x)=Aix+bi,AiRq×p,biRqf_i(x) = A_i x + b_i, \quad A_i \in \mathbb{R}^{q \times p}, b_i \in \mathbb{R}^q

  • Model Output: The final prediction is the mixture

μ(x)=i=1Kgi(x)fi(x)\mu(x) = \sum_{i=1}^K g_i(x) f_i(x)

or, if modeling conditional densities,

p(yx)=i=1Kgi(x)N(y;Aix+bi,Ci)p(y \mid x) = \sum_{i=1}^K g_i(x) \mathcal{N}(y; A_i x + b_i, C_i)

where CiRq×qC_i \in \mathbb{R}^{q \times q} is a positive-definite covariance (Nguyen et al., 2017).

The parameter counts scale as follows:

Component Parameters per Expert Total Parameters
Gating (softmax) wiRpw_i \in \mathbb{R}^p, ciRc_i \in \mathbb{R} K(p+1)K \cdot (p+1)
Expert AiRq×pA_i \in \mathbb{R}^{q \times p}, biRqb_i \in \mathbb{R}^q Kq(p+1)K \cdot q \cdot (p+1)
Covariance (optional) CiC_i: q(q+1)/2q(q+1)/2 Kq(q+1)/2K \cdot q(q+1)/2

2. Theoretical Properties and Approximation Results

Linear-MoE architectures possess strong universal approximation properties for both mean functions and conditional densities:

  • Density Approximation: For any collection of true conditional marginal densities gYjX(yjx)g_{Y_j|X}(y_j|x) on compact XRpX \subset \mathbb{R}^p, there exists a Linear-MoE (with sufficiently large KK) that approximates each p(yjx)p(y_j|x) to arbitrary precision in conditional KL-divergence given suitable smoothness and positivity conditions (Nguyen et al., 2017).
  • Mean-Function Denseness: The model class forms a dense subset in the space of continuous vector-valued functions. For any continuous μ:XRq\mu: X \to \mathbb{R}^q and ϵ>0\epsilon > 0, one can choose KK and parameters so that

μ^μq,=maxxXj=1qμ^j(x)μj(x)<ϵ\|\hat{\mu} - \mu\|_{q,\infty} = \max_{x \in X} \sum_{j=1}^q |\hat{\mu}_j(x) - \mu_j(x)| < \epsilon

Thus, Linear-MoE can model arbitrary continuous multivariate regressions (Nguyen et al., 2017).

These properties are enabled by closure under summation (for mean functions) and permutation (for conditional densities) of the MoE class, allowing one to build multivariate approximators by composition of independent univariate MoEs and mixture-of-Gaussians (Nguyen et al., 2017).

3. Sparse Routing and Load Balancing

The canonical sparsity procedure is top-kk gating in which, for each token or input, the kk experts with highest gating probabilities gi(x)g_i(x) are selected. The normalized weights for the chosen subset are:

we(x)=max{ge(x),0}eSxmax{ge(x),0}w_e(x) = \frac{\max\{g_e(x), 0\}}{\sum_{e' \in \mathcal{S}_x} \max\{g_{e'}(x), 0\}}

Strict capacity constraints are enforced by this per-token top-kk selection (Harvey et al., 19 Jun 2025, Sun et al., 7 Mar 2025).

To promote balanced expert usage and avoid "stragglers," an auxiliary load-balancing loss term is often introduced:

Laux=αauxEe=1Efege    or    =αauxEe=1E(ge)2fe\mathcal{L}_{\mathrm{aux}} = \alpha_{\mathrm{aux}} E \sum_{e=1}^E f_e g_e \;\;\text{or}\;\; = \alpha_{\mathrm{aux}} E \sum_{e=1}^E (g_e)^2 f_e

where fef_e is the fraction of tokens routed to expert ee, geg_e is the average routing probability, and αaux\alpha_{\mathrm{aux}} is a hyperparameter (Harvey et al., 19 Jun 2025). This regularizer encourages uniform utilization but can degrade specialization if set too strongly (Cheng et al., 14 Nov 2025).

4. Integration with Sequence and Transformer Architectures

The modern instantiations of Linear-MoE deploy MoE layers within various high-throughput sequence modeling and Transformer frameworks:

  • Linear Sequence Modeling (LSM) + MoE: Linear-MoE integrates linear-complexity sequence modules including linear attention (O(N)O(N)), structured state-space models (SSMs), and linear RNNs, chaining each block as:

LNLSM+LNMoE+\mathrm{LN} \rightarrow \mathrm{LSM} \rightarrow + \rightarrow \mathrm{LN} \rightarrow \mathrm{MoE} \rightarrow +

All MoE layers share the gating/routing machinery above, while the LSM component provides the token-mixing operator, facilitating O(N)O(N) runtime and efficient parallelism (Data, Tensor, Pipeline, Expert, Sequence Parallelism) (Sun et al., 7 Mar 2025).

  • Shared Experts in Attention and FFN: The UMoE model reformulates multi-head attention as a MoE sublayer, enabling shared expert modules across both attention and FFN and efficient parameter reuse. Each expert block is a two-layer FFN, and routers can be separated for attention and FFN sub-modules or tied for additional savings (Yang et al., 12 May 2025).
  • Hybrid Linear-MoE/Transformer-MoE: For tasks demanding inductive biases from both softmax attention and LSM, hybrid models alternate Linear-MoE and standard Transformer-MoE blocks, using the same routing/gating layers (Sun et al., 7 Mar 2025).

5. Scaling Laws, Efficiency, and Architectural Tradeoffs

Scaling theory for Linear-MoE predicts compute advantage and capacity via the Efficiency Leverage (EL) metric:

EL(XMoEXDense;Ctarget)=CdenseCmoeEL(X_{\mathrm{MoE}} \mid X_{\mathrm{Dense}}; C_{\mathrm{target}}) = \frac{C_{\mathrm{dense}}}{C_{\mathrm{moe}}}

where CC is the training compute required to reach the same loss (up to a small ϵ\epsilon) (Tian et al., 23 Jul 2025). Key empirical findings:

  • Activation Ratio (AA): The fraction of experts activated per token (A=K/EA = K / E) primarily determines EL, with ELAaEL \propto A^{-a} for a[1,1.5]a \in [1, 1.5]; lower AA (greater sparsity) yields higher leverage.
  • Expert Granularity (GG): Defined as G=2dmodel/dexpertG = 2 d_{\mathrm{model}} / d_{\mathrm{expert}}; optimal EL emerges in the range G8G \approx 8–$12$ due to a U-shaped dependence in logEL(logG)\log EL(\log G).
  • Compute Budget (CC): Scaling EL as a power law in CC, with modest increases for large budgets.
  • Unified Law:

EL(A,G,C)=Aˉa+dlogC+γ(logG)2+βlogGEL(A, G, C) = \bar{A}^{a + d \log C + \gamma (\log G)^2 + \beta \log G}

Coefficients are empirically fitted (see Table 3, (Tian et al., 23 Jul 2025)). This law accurately predicts that, for example, a model with A=3.4%A=3.4\%, G=12G=12, and C=1×1022C=1{\times}10^{22} attains EL7EL \approx 7, consistent with Ling-mini-beta ($0.85$B active params) matching dense-6.1B with $1/7$ the training FLOPs.

6. Comparative Routers and Recent Innovations

Several routing mechanisms have been studied for their trade-offs in expressivity, efficiency, and stability:

  • Linear Router: Minimal overhead, a single affine projection (6\sim6 k params for typical settings), extremely fast routing (0.07 ms/token), moderate entropy, and smooth, uniform expert usage (Harvey et al., 19 Jun 2025). However, limited semantic awareness and nonlinear selectivity compared to multilayer alternatives.
  • Attention and MLP Routers: Offer greater expressivity, higher routing entropy (2.08\sim2.08 bits vs. $1.95$ for Linear), and improved feature-space partitioning, but incur 4–16× parameter and latency costs (Harvey et al., 19 Jun 2025).
  • Eigenbasis/ERMoE Routing: ERMoE replaces gating logits with a cosine similarity (“Eigenbasis Score”) between the input and a learned orthonormal basis for each expert. This content-aware, geometry-based routing achieves highly stable expert utilization, obviates balancing losses, and produces anatomically or semantically interpretable specializations (Cheng et al., 14 Nov 2025). ERMoE achieves state-of-the-art accuracy on ImageNet, COCO, and clinical imaging benchmarks, and in 3D MRI brain-age prediction achieves 2.31 MAE vs. 2.83 for the best dense model.
  • UMoE Shared Experts: Demonstrates that simultaneous expert-sharing across attention and FFN unlocks additional parameter efficiency and outperforms dense or prior MoE architectures in both perplexity and downstream accuracy (Yang et al., 12 May 2025).

7. Practical Implementation and Empirical Results

Linear-MoE systems support state-of-the-art empirical performance in production-scale settings. Key workflow steps include:

  1. Selection of KK (number of experts): Chosen via information criterion or cross-validation to balance approximation power, overfitting, and computational footprint (Nguyen et al., 2017).
  2. Parametrization and Training: Gating network (softmax or geometry-based), linear/MLP experts, and, if needed, auxiliary regularizers (for classic Linear routing); all parameters trained jointly with gradient or EM-based optimization (Sun et al., 7 Mar 2025).
  3. Parallel Execution: Leveraging advanced parallelism (Data/Tensor/Pipeline/Expert/Sequence) for strong-scaling efficiency at long sequence lengths and large parameter counts (Sun et al., 7 Mar 2025).
  4. Empirical Benchmarks: Across multiple model sizes (e.g., A0.3B–2B, A1B–7B), Linear-MoE attains efficiency gains (up to 2×2\times inference speed and O(1)O(1) memory with up to 16K context) with accuracy competitive to dense and baseline MoE architectures (Sun et al., 7 Mar 2025, Tian et al., 23 Jul 2025). In vision and language tasks, modern Linear-MoE architectures consistently match or outperform their dense counterparts using 7×7\times less compute (Tian et al., 23 Jul 2025).

Summary Table: Router Variants in Linear-MoE

Router Type Param Count (E=8, d=768) Entropy (H(P)) Latency (ms/token) Notable Property
Linear 6,144 1.95 0.07 Minimal, classical, stable
Attention 49,664 2.08 0.29 Embedding-based, higher awareness
MLP 101,000 2.08 0.23 Nonlinear, expressive
ERMoE (basis) O(d2)O(d^2) (basis, small) Flattest 0.3–0.4 Content-aware, interpretable
MLP-Hadamard 101,000 1.10 0.88 Structured, sharp two-expert splits
Hash 0 0 85.0 Deterministic, not used in practice

This comparative summary underscores both the efficiency–expressivity frontier and recent advances in geometry- or content-aware routing.

References

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Linear-MoE Architecture.