Papers
Topics
Authors
Recent
Search
2000 character limit reached

Eigenbasis-Guided Routing (EMoE)

Updated 12 May 2026
  • Eigenbasis-Guided Routing leverages orthonormal eigenbases to optimize token routing in Mixture-of-Experts models, enhancing balance and specialization.
  • This method addresses imbalance and homogeneity challenges by aligning routing decisions with principal data directions, eliminating auxiliary losses.
  • Applications of EMoE and ERMoE show improved results in vision and biomedical fields, highlighting the architecture's versatile performance.

Eigenbasis-Guided Routing (EMoE and ERMoE) is a class of Mixture-of-Experts (MoE) architectures in which routing decisions are grounded in projections onto learned orthonormal eigenbases derived from the input feature space or experts’ representation spaces. These methods address inherent challenges in sparse MoE models, notably load imbalance (“rich get richer”) and expert homogeneity, by leveraging principled geometric partitioning of the token or feature manifold. The approach obviates the need for auxiliary load-balancing losses, enhances utilization stability, and promotes diverse, interpretable expert specialization. Key instantiations include EMoE (Cheng et al., 17 Jan 2026) and ERMoE (Cheng et al., 14 Nov 2025), both demonstrating state-of-the-art results in large-scale vision, retrieval, and biomedical tasks.

1. Motivation and Core Problems in Mixture-of-Experts Routing

Mixture-of-Experts architectures scale neural network capacity by conditionally activating a sparse set of experts. However, practical deployments exhibit two recurrent problems:

  • Load imbalance (“rich get richer”): Standard MoE routers often concentrate the majority of tokens on a small subset of experts, leading to over-utilization, straggler bottlenecks, and underutilization of network capacity.
  • Expert homogeneity: Auxiliary load-balancing terms, designed to alleviate imbalance, tend to enforce uniform routing at the expense of expert specialization—experts converge to redundant, non-diverse representations, negating the intended benefits of modularity and heterogeneity.

Conventional MoE mechanisms, typically based on learned gating networks with cross-entropy or auxiliary losses, encounter a trade-off between specialization and balanced assignment. Eigenbasis-Guided Routing frameworks replace these learned routers and balancing heuristics with a geometric, content-aware partitioning that ties assignments directly to the data’s principal directions or experts’ learned subspaces (Cheng et al., 17 Jan 2026, Cheng et al., 14 Nov 2025).

2. Eigenbasis Construction and Orthonormality Constraints

EMoE: Shared Feature Eigenbasis

  • For each MoE layer, all NN token embeddings htRDh_t \in \mathbb{R}^D from a mini-batch are collected in HRN×DH \in \mathbb{R}^{N \times D}.
  • The empirical feature covariance is

C=1NHH=1Nt=1NhthtRD×D\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}

  • The top-rr eigenvectors URD×r\mathbf{U} \in \mathbb{R}^{D \times r} are obtained by solving

Cvi=λivi,i=1,,r\mathbf{C}\,v_i = \lambda_i v_i, \quad i=1,\ldots, r

with orthonormality enforced via the Frobenius penalty

Lortho=λorthoUUIrF2L_{\text{ortho}} = \lambda_{\text{ortho}} \|\mathbf{U}^\top \mathbf{U} - \mathbf{I}_r\|^2_F

where λortho\lambda_{\text{ortho}} is typically 10310^{-3}htRDh_t \in \mathbb{R}^D0.

ERMoE: Per-Expert Eigenbasis Reparameterization

  • Each expert htRDh_t \in \mathbb{R}^D1’s linear transformation is parameterized as:

htRDh_t \in \mathbb{R}^D2

with htRDh_t \in \mathbb{R}^D3 orthonormal (htRDh_t \in \mathbb{R}^D4).

  • Orthonormality is softly enforced for each basis via a light Frobenius penalty.

This eigenbasis construction grounds routing and specialization in explicit, geometrically meaningful subspaces—balancing token assignment and promoting interpretability.

3. Routing Mechanisms Based on Principal Components

EMoE Algorithm

  • Projection: Each feature htRDh_t \in \mathbb{R}^D5 is projected into the htRDh_t \in \mathbb{R}^D6-dimensional eigen-subspace:

htRDh_t \in \mathbb{R}^D7

  • Energy fractions: For each principal direction,

htRDh_t \in \mathbb{R}^D8

with htRDh_t \in \mathbb{R}^D9 on the probability simplex.

  • Expert scores: Each of HRN×DH \in \mathbb{R}^{N \times D}0 experts receives a score:

HRN×DH \in \mathbb{R}^{N \times D}1

where HRN×DH \in \mathbb{R}^{N \times D}2, HRN×DH \in \mathbb{R}^{N \times D}3 are scalars, and HRN×DH \in \mathbb{R}^{N \times D}4 biases.

  • Sparse gating: A softmax with temperature yields HRN×DH \in \mathbb{R}^{N \times D}5, the token is routed to the top-1 expert HRN×DH \in \mathbb{R}^{N \times D}6. Only expert HRN×DH \in \mathbb{R}^{N \times D}7’s MLP is executed, with output added residually, scaled by learned HRN×DH \in \mathbb{R}^{N \times D}8.

ERMoE Algorithm

  • Projection and normalization: Input token HRN×DH \in \mathbb{R}^{N \times D}9 and its context C=1NHH=1Nt=1NhthtRD×D\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}0 (from self-attention) are projected into expert C=1NHH=1Nt=1NhthtRD×D\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}1’s eigenbasis:

C=1NHH=1Nt=1NhthtRD×D\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}2

  • Eigenbasis Score:

C=1NHH=1Nt=1NhthtRD×D\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}3

  • Thresholded top-C=1NHH=1Nt=1NhthtRD×D\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}4 routing: A confidence threshold C=1NHH=1Nt=1NhthtRD×D\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}5 selects eligible experts for each token; the top-C=1NHH=1Nt=1NhthtRD×D\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}6 scores are used, with normalized mixture weights:

C=1NHH=1Nt=1NhthtRD×D\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}7

The output is the mixture C=1NHH=1Nt=1NhthtRD×D\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}8.

Both approaches tie routing decisions to geometric alignment with data-driven or expert-specific subspaces, in contrast to free learned gating networks.

4. Balanced Utilization and Expert Specialization

Eigenbasis-guided routing enforces a form of intrinsic balancing based on the distribution of data variance across principal components. Key properties include:

  • Feature space partitioning along orthogonal principal directions, yielding natural diversity among experts.
  • Tokens with high variance alignments are routed in proportion to the data’s energy along each subspace, inherently preventing “starvation” of low-variance experts.
  • Empirical results demonstrate near-uniform expert utilization on datasets like ImageNet, with class subsets coherently mapped to specific experts on smaller datasets, but without expert collapse (“rich get richer”) (Cheng et al., 17 Jan 2026).
  • ERMoE achieves stable routing curves and interpretable class/expert correspondences, with late layers developing sharp but overlapping semantic preferences (Cheng et al., 14 Nov 2025).

This geometric routing mechanism eliminates the need for auxiliary balancing losses, which previously interfered with gradient flow and expert specialization.

5. Training Regimes and Architectural Details

Hyperparameter EMoE ERMoE
Number of eigenvectors C=1NHH=1Nt=1NhthtRD×D\mathbf{C} = \frac{1}{N} H^\top H = \frac{1}{N} \sum_{t=1}^N h_t h_t^\top \in \mathbb{R}^{D \times D}9 or rr0 (rr1) rr2 (full width per expert)
Number of experts rr3 rr4 (typically 8, sometimes more)
Gating temperature rr5 Not used (thresholded top-rr6)
Orthonormality weight rr7–rr8 rr9
Expert output scaling Learned per layer N/A
Losses URD×r\mathbf{U} \in \mathbb{R}^{D \times r}0 URD×r\mathbf{U} \in \mathbb{R}^{D \times r}1

Detailed procedure for EMoE involves updating URD×r\mathbf{U} \in \mathbb{R}^{D \times r}2 and its loss at each step; for ERMoE, each expert maintains independent bases, with routine re-orthogonalization and soft penalties. All parameters are trained via backpropagation; routing is sparse (EMoE: top-1, ERMoE: thresholded top-URD×r\mathbf{U} \in \mathbb{R}^{D \times r}3).

In both approaches, no explicit router balance or auxiliary loss is used—the geometric formulation is sufficient for robust behavior.

6. Empirical Results and Domain Extensions

Computer Vision Benchmarks

  • ImageNet-1K: EMoE-ViT-H achieves Top-1/Top-5 accuracy of 88.14% / 98.27%, improving upon V-MoE and single-gated MoE baselines (Cheng et al., 17 Jan 2026). ERMoE attains 88.03% / 98.97% (ViT-B/16, top-2 routing, URD×r\mathbf{U} \in \mathbb{R}^{D \times r}4) (Cheng et al., 14 Nov 2025).
  • Few-shot settings: On CIFAR-100 and Tiny-ImageNet, EMoE and ERMoE outperform previous MoE baselines by 3–7 percentage points in 5/10-shot regimes.
  • Multimodal retrieval: ERMoE integrated with CLIP improves COCO R@1 to 65.4%, surpassing CLIP-MoE’s 65.0%.

Biomedical Imaging

  • 3D-CNN extension: EMoE-3D-CNN computes covariances over volumetric patch features, reducing MAE in brain-age estimation from 2.41 years to 2.16 years (Cheng et al., 17 Jan 2026).
  • ERMoE-ba: (3 region experts + 5 free experts) achieves MAE = 2.31 years, beating 3D Swin/ViT/CNN baselines (2.83−3.52y) (Cheng et al., 14 Nov 2025).

Load Balancing and Expert Activity

  • Heatmaps show tokens and classes distributed across experts according to feature structure, with all experts remaining active.
  • Expert utilization curves remain flat—no collapse to the “rich get richer” regime; peak-to-mean token count per expert remains within ±10% [(Cheng et al., 17 Jan 2026), Fig. 5; (Cheng et al., 14 Nov 2025), Fig. expert_comp].

These results substantiate the claim that eigenbasis-guided routing achieves both high performance and superior utilization balance without auxiliary losses.

7. Interpretability, Limitations, and Future Directions

Interpretability and specialization arise naturally from the geometric grounding of routing:

  • In vision layers, class–expert heatmaps reveal that deeper layers develop crisp, semantically structured expert preferences.
  • In medical 3D imaging, region-ablation probes demonstrate that experts’ eigenbases align with anatomically meaningful subspaces (white matter, gray matter, cerebrospinal fluid) over training epochs.

Principal limitations and open problems include:

  • Computational overhead: Maintenance and orthogonalization of eigenbases induce modest URD×r\mathbf{U} \in \mathbb{R}^{D \times r}5 (or URD×r\mathbf{U} \in \mathbb{R}^{D \times r}6 per expert in ERMoE) overhead per layer.
  • Eigenbasis size choice: Diminishing returns observed in EMoE beyond URD×r\mathbf{U} \in \mathbb{R}^{D \times r}7 (Cheng et al., 17 Jan 2026); impact of threshold URD×r\mathbf{U} \in \mathbb{R}^{D \times r}8 and orthogonality weight URD×r\mathbf{U} \in \mathbb{R}^{D \times r}9 in ERMoE is subject to further study (Cheng et al., 14 Nov 2025).
  • Extension to large expert counts: Efficient eigen-updating strategies for very large Cvi=λivi,i=1,,r\mathbf{C}\,v_i = \lambda_i v_i, \quad i=1,\ldots, r0 or Cvi=λivi,i=1,,r\mathbf{C}\,v_i = \lambda_i v_i, \quad i=1,\ldots, r1 are not yet resolved.
  • Behavior in low-data and zero-shot regimes: Remains an open question, as these setups may challenge the stability of eigenbasis estimation and expert interpretability.
  • Theoretical analysis: Deeper study of geometric partitioning dynamics, the emergence of diverse eigenbases, and the trade-offs in threshold/top-Cvi=λivi,i=1,,r\mathbf{C}\,v_i = \lambda_i v_i, \quad i=1,\ldots, r2 selection is ongoing.

Future directions include dynamic selection of Cvi=λivi,i=1,,r\mathbf{C}\,v_i = \lambda_i v_i, \quad i=1,\ldots, r3/Cvi=λivi,i=1,,r\mathbf{C}\,v_i = \lambda_i v_i, \quad i=1,\ldots, r4, sharing eigenbases across modalities, scaling to high-dimensional settings, and formalizing the theoretical underpinnings of geometry-guided conditional computation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Eigenbasis-Guided Routing (EMoE).