Eigenbasis-Guided Routing (EMoE)

Updated 12 May 2026

Eigenbasis-Guided Routing leverages orthonormal eigenbases to optimize token routing in Mixture-of-Experts models, enhancing balance and specialization.
This method addresses imbalance and homogeneity challenges by aligning routing decisions with principal data directions, eliminating auxiliary losses.
Applications of EMoE and ERMoE show improved results in vision and biomedical fields, highlighting the architecture's versatile performance.

Eigenbasis-Guided Routing (EMoE and ERMoE) is a class of Mixture-of-Experts (MoE) architectures in which routing decisions are grounded in projections onto learned orthonormal eigenbases derived from the input feature space or experts’ representation spaces. These methods address inherent challenges in sparse MoE models, notably load imbalance (“rich get richer”) and expert homogeneity, by leveraging principled geometric partitioning of the token or feature manifold. The approach obviates the need for auxiliary load-balancing losses, enhances utilization stability, and promotes diverse, interpretable expert specialization. Key instantiations include EMoE (Cheng et al., 17 Jan 2026) and ERMoE (Cheng et al., 14 Nov 2025), both demonstrating state-of-the-art results in large-scale vision, retrieval, and biomedical tasks.

1. Motivation and Core Problems in Mixture-of-Experts Routing

Mixture-of-Experts architectures scale neural network capacity by conditionally activating a sparse set of experts. However, practical deployments exhibit two recurrent problems:

Load imbalance (“rich get richer”): Standard MoE routers often concentrate the majority of tokens on a small subset of experts, leading to over-utilization, straggler bottlenecks, and underutilization of network capacity.
Expert homogeneity: Auxiliary load-balancing terms, designed to alleviate imbalance, tend to enforce uniform routing at the expense of expert specialization—experts converge to redundant, non-diverse representations, negating the intended benefits of modularity and heterogeneity.

Conventional MoE mechanisms, typically based on learned gating networks with cross-entropy or auxiliary losses, encounter a trade-off between specialization and balanced assignment. Eigenbasis-Guided Routing frameworks replace these learned routers and balancing heuristics with a geometric, content-aware partitioning that ties assignments directly to the data’s principal directions or experts’ learned subspaces (Cheng et al., 17 Jan 2026, Cheng et al., 14 Nov 2025).

2. Eigenbasis Construction and Orthonormality Constraints

EMoE: Shared Feature Eigenbasis

For each MoE layer, all $N$ token embeddings $h_t \in \mathbb{R}^D$ from a mini-batch are collected in $H \in \mathbb{R}^{N \times D}$ .
The empirical feature covariance is

$\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}$

The top- $r$ eigenvectors $\mathbf{U} \in \mathbb{R}^{D \times r}$ are obtained by solving

$\mathbf{C}\,v_i = \lambda_i v_i, \quad i=1,\ldots, r$

with orthonormality enforced via the Frobenius penalty

$L_{\text{ortho}} = \lambda_{\text{ortho}} \|\mathbf{U}^\top \mathbf{U} - \mathbf{I}_r\|^2_F$

where $\lambda_{\text{ortho}}$ is typically $10^{-3}$ – $h_t \in \mathbb{R}^D$ 0.

ERMoE: Per-Expert Eigenbasis Reparameterization

Each expert $h_t \in \mathbb{R}^D$ 1’s linear transformation is parameterized as:

$h_t \in \mathbb{R}^D$ 2

with $h_t \in \mathbb{R}^D$ 3 orthonormal ( $h_t \in \mathbb{R}^D$ 4).

Orthonormality is softly enforced for each basis via a light Frobenius penalty.

This eigenbasis construction grounds routing and specialization in explicit, geometrically meaningful subspaces—balancing token assignment and promoting interpretability.

3. Routing Mechanisms Based on Principal Components

EMoE Algorithm

Projection: Each feature $h_t \in \mathbb{R}^D$ 5 is projected into the $h_t \in \mathbb{R}^D$ 6-dimensional eigen-subspace:

$h_t \in \mathbb{R}^D$ 7

Energy fractions: For each principal direction,

$h_t \in \mathbb{R}^D$ 8

with $h_t \in \mathbb{R}^D$ 9 on the probability simplex.

Expert scores: Each of $H \in \mathbb{R}^{N \times D}$ 0 experts receives a score:

$H \in \mathbb{R}^{N \times D}$ 1

where $H \in \mathbb{R}^{N \times D}$ 2, $H \in \mathbb{R}^{N \times D}$ 3 are scalars, and $H \in \mathbb{R}^{N \times D}$ 4 biases.

Sparse gating: A softmax with temperature yields $H \in \mathbb{R}^{N \times D}$ 5, the token is routed to the top-1 expert $H \in \mathbb{R}^{N \times D}$ 6. Only expert $H \in \mathbb{R}^{N \times D}$ 7’s MLP is executed, with output added residually, scaled by learned $H \in \mathbb{R}^{N \times D}$ 8.

ERMoE Algorithm

Projection and normalization: Input token $H \in \mathbb{R}^{N \times D}$ 9 and its context $\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}$ 0 (from self-attention) are projected into expert $\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}$ 1’s eigenbasis:

$\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}$ 2

Eigenbasis Score:

$\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}$ 3

Thresholded top- $\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}$ 4 routing: A confidence threshold $\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}$ 5 selects eligible experts for each token; the top- $\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}$ 6 scores are used, with normalized mixture weights:

$\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}$ 7

The output is the mixture $\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}$ 8.

Both approaches tie routing decisions to geometric alignment with data-driven or expert-specific subspaces, in contrast to free learned gating networks.

4. Balanced Utilization and Expert Specialization

Eigenbasis-guided routing enforces a form of intrinsic balancing based on the distribution of data variance across principal components. Key properties include:

Feature space partitioning along orthogonal principal directions, yielding natural diversity among experts.
Tokens with high variance alignments are routed in proportion to the data’s energy along each subspace, inherently preventing “starvation” of low-variance experts.
Empirical results demonstrate near-uniform expert utilization on datasets like ImageNet, with class subsets coherently mapped to specific experts on smaller datasets, but without expert collapse (“rich get richer”) (Cheng et al., 17 Jan 2026).
ERMoE achieves stable routing curves and interpretable class/expert correspondences, with late layers developing sharp but overlapping semantic preferences (Cheng et al., 14 Nov 2025).

This geometric routing mechanism eliminates the need for auxiliary balancing losses, which previously interfered with gradient flow and expert specialization.

5. Training Regimes and Architectural Details

Hyperparameter	EMoE	ERMoE
Number of eigenvectors	$\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}$ 9 or $r$ 0 ( $r$ 1)	$r$ 2 (full width per expert)
Number of experts	$r$ 3	$r$ 4 (typically 8, sometimes more)
Gating temperature	$r$ 5	Not used (thresholded top- $r$ 6)
Orthonormality weight	$r$ 7– $r$ 8	$r$ 9
Expert output scaling	Learned per layer	N/A
Losses	$\mathbf{U} \in \mathbb{R}^{D \times r}$ 0	$\mathbf{U} \in \mathbb{R}^{D \times r}$ 1

Detailed procedure for EMoE involves updating $\mathbf{U} \in \mathbb{R}^{D \times r}$ 2 and its loss at each step; for ERMoE, each expert maintains independent bases, with routine re-orthogonalization and soft penalties. All parameters are trained via backpropagation; routing is sparse (EMoE: top-1, ERMoE: thresholded top- $\mathbf{U} \in \mathbb{R}^{D \times r}$ 3).

In both approaches, no explicit router balance or auxiliary loss is used—the geometric formulation is sufficient for robust behavior.

6. Empirical Results and Domain Extensions

Computer Vision Benchmarks

ImageNet-1K: EMoE-ViT-H achieves Top-1/Top-5 accuracy of 88.14% / 98.27%, improving upon V-MoE and single-gated MoE baselines (Cheng et al., 17 Jan 2026). ERMoE attains 88.03% / 98.97% (ViT-B/16, top-2 routing, $\mathbf{U} \in \mathbb{R}^{D \times r}$ 4) (Cheng et al., 14 Nov 2025).
Few-shot settings: On CIFAR-100 and Tiny-ImageNet, EMoE and ERMoE outperform previous MoE baselines by 3–7 percentage points in 5/10-shot regimes.
Multimodal retrieval: ERMoE integrated with CLIP improves COCO R@1 to 65.4%, surpassing CLIP-MoE’s 65.0%.

Biomedical Imaging

3D-CNN extension: EMoE-3D-CNN computes covariances over volumetric patch features, reducing MAE in brain-age estimation from 2.41 years to 2.16 years (Cheng et al., 17 Jan 2026).
ERMoE-ba: (3 region experts + 5 free experts) achieves MAE = 2.31 years, beating 3D Swin/ViT/CNN baselines (2.83−3.52y) (Cheng et al., 14 Nov 2025).

Load Balancing and Expert Activity

Heatmaps show tokens and classes distributed across experts according to feature structure, with all experts remaining active.
Expert utilization curves remain flat—no collapse to the “rich get richer” regime; peak-to-mean token count per expert remains within ±10% [(Cheng et al., 17 Jan 2026), Fig. 5; (Cheng et al., 14 Nov 2025), Fig. expert_comp].

These results substantiate the claim that eigenbasis-guided routing achieves both high performance and superior utilization balance without auxiliary losses.

7. Interpretability, Limitations, and Future Directions

Interpretability and specialization arise naturally from the geometric grounding of routing:

In vision layers, class–expert heatmaps reveal that deeper layers develop crisp, semantically structured expert preferences.
In medical 3D imaging, region-ablation probes demonstrate that experts’ eigenbases align with anatomically meaningful subspaces (white matter, gray matter, cerebrospinal fluid) over training epochs.

Principal limitations and open problems include:

Computational overhead: Maintenance and orthogonalization of eigenbases induce modest $\mathbf{U} \in \mathbb{R}^{D \times r}$ 5 (or $\mathbf{U} \in \mathbb{R}^{D \times r}$ 6 per expert in ERMoE) overhead per layer.
Eigenbasis size choice: Diminishing returns observed in EMoE beyond $\mathbf{U} \in \mathbb{R}^{D \times r}$ 7 (Cheng et al., 17 Jan 2026); impact of threshold $\mathbf{U} \in \mathbb{R}^{D \times r}$ 8 and orthogonality weight $\mathbf{U} \in \mathbb{R}^{D \times r}$ 9 in ERMoE is subject to further study (Cheng et al., 14 Nov 2025).
Extension to large expert counts: Efficient eigen-updating strategies for very large $\mathbf{C}\,v_i = \lambda_i v_i, \quad i=1,\ldots, r$ 0 or $\mathbf{C}\,v_i = \lambda_i v_i, \quad i=1,\ldots, r$ 1 are not yet resolved.
Behavior in low-data and zero-shot regimes: Remains an open question, as these setups may challenge the stability of eigenbasis estimation and expert interpretability.
Theoretical analysis: Deeper study of geometric partitioning dynamics, the emergence of diverse eigenbases, and the trade-offs in threshold/top- $\mathbf{C}\,v_i = \lambda_i v_i, \quad i=1,\ldots, r$ 2 selection is ongoing.

Future directions include dynamic selection of $\mathbf{C}\,v_i = \lambda_i v_i, \quad i=1,\ldots, r$ 3/ $\mathbf{C}\,v_i = \lambda_i v_i, \quad i=1,\ldots, r$ 4, sharing eigenbases across modalities, scaling to high-dimensional settings, and formalizing the theoretical underpinnings of geometry-guided conditional computation.

Markdown Report Issue Upgrade to Chat

References (2)

EMoE: Eigenbasis-Guided Routing for Mixture-of-Experts (2026)

ERMoE: Eigen-Reparameterized Mixture-of-Experts for Stable Routing and Interpretable Specialization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Eigenbasis-Guided Routing (EMoE).

Eigenbasis-Guided Routing (EMoE)

1. Motivation and Core Problems in Mixture-of-Experts Routing

2. Eigenbasis Construction and Orthonormality Constraints

EMoE: Shared Feature Eigenbasis

ERMoE: Per-Expert Eigenbasis Reparameterization

3. Routing Mechanisms Based on Principal Components

EMoE Algorithm

ERMoE Algorithm

4. Balanced Utilization and Expert Specialization

5. Training Regimes and Architectural Details

6. Empirical Results and Domain Extensions

Computer Vision Benchmarks

Biomedical Imaging

Load Balancing and Expert Activity

7. Interpretability, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Eigenbasis-Guided Routing (EMoE)

1. Motivation and Core Problems in Mixture-of-Experts Routing

2. Eigenbasis Construction and Orthonormality Constraints

EMoE: Shared Feature Eigenbasis

ERMoE: Per-Expert Eigenbasis Reparameterization

3. Routing Mechanisms Based on Principal Components

EMoE Algorithm

ERMoE Algorithm

4. Balanced Utilization and Expert Specialization

5. Training Regimes and Architectural Details

6. Empirical Results and Domain Extensions

Computer Vision Benchmarks

Biomedical Imaging

Load Balancing and Expert Activity

7. Interpretability, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research