Papers
Topics
Authors
Recent
Search
2000 character limit reached

CartesianMoE: Cartesian Routing in MoE

Updated 25 May 2026
  • CartesianMoE is a Mixture-of-Experts architecture that partitions experts into two sub-pools and uses Cartesian product routing for enhanced group-wise parameter sharing.
  • It mitigates routing noise through a multiplicative fusion of sub-expert pools, leading to lower perplexity and improved downstream accuracy compared to conventional MoE methods.
  • Empirical evaluations show significant improvements in perplexity and task performance across diverse model scales, highlighting its robustness and scalability in large language models.

CartesianMoE is a Mixture-of-Experts (MoE) architecture designed to achieve enhanced knowledge sharing, improved robustness to routing noise, and superior performance in LLMs by employing a Cartesian product routing mechanism inspired by collective matrix factorization. In contrast to conventional MoE approaches that aggregate expert outputs additively and often result in isolated expert learning, CartesianMoE leverages a multiplicative fusion of sub-expert pools, enabling group-wise parameter sharing between experts and leading to substantial empirical gains in perplexity and downstream tasks (Su et al., 2024).

1. Limitations of Conventional Mixture-of-Experts Methods

Standard MoE architectures deploy NN independent expert feedforward networks (FFNk_k), with each expert trained in isolation. This independence inhibits knowledge sharing; as a result, the token representations processed by an MoE layer are primarily determined by the small subset of experts selected by the routing network. Fluctuations in routing—termed "routing noise"—directly affect outputs, raising model sensitivity to routing errors.

Early mitigation approaches, such as SwitchTransformer and GLaM, introduce shared experts whose activations are added to the outputs of dynamically routed experts: yt=∑i∈Top-KαiFFNi(ht)+∑j∈sharedβjFFNj(ht)y_t = \sum_{i \in \text{Top-K}} \alpha_i \mathrm{FFN}_i(h_t) + \sum_{j \in \text{shared}} \beta_j \mathrm{FFN}_j(h_t) While this additive fusion delivers a global component, its sharing is coarse-grained; every routed expert receives the identical shared component, precluding more differentiated, group-wise sharing schemes.

2. Cartesian-Product Routing: Formal Definition and Motivation

Motivated by collective matrix factorization, CartesianMoE partitions the expert pool into two equal sub-pools, denoted A and B, each comprising ee sub-experts. The effective "expert" at inference or training time corresponds to a unique pairing (i.e., the Cartesian product) of one sub-expert from A and one from B, producing e2e^2 distinct virtual experts. This structure permits parameter and representational sharing across experts that share common sub-expert components.

A single Cartesian Product Layer (CPL) processes inputs as follows:

  1. Sub-layer A Routing:

h~(1)=∑i=1eri(1)(ht) FFNi(ht)s.t. ∥r(1)(ht)∥0=k\tilde{h}^{(1)} = \sum_{i=1}^e r^{(1)}_i(h^t)\ \mathrm{FFN}_i(h^t) \quad \text{s.t.}\ \|r^{(1)}(h^t)\|_0 = k

where hth^t is the input and ri(1)(ht)r^{(1)}_i(h^t) is the gating score for sub-expert ii.

  1. Residual Update and Sub-layer B Routing:

h(1)=ht+h~(1)h^{(1)} = h^t + \tilde{h}^{(1)}

k_k0

  1. Final Output:

k_k1

This construction associates each computation path through CPL with a particular (A, B) pair, achieving knowledge sharing through multiplicative residual composition.

3. Architectural Specifications

Every Mixture-of-Experts feedforward layer in the Transformer backbone is replaced by a Cartesian Product Layer composed of two serial fine-grained MoE sub-layers:

  • Expert Pools: Each sub-layer contains k_k2 fine-grained (half-sized) FFNs. The intermediate dimension per FFN is set to k_k3, where k_k4 is the canonical dense FFN width and k_k5 the split granularity.
  • Shared Experts: Each sub-layer may incorporate a fixed, always-on shared expert for global knowledge assimilation.
  • Gating Networks: For each MoE sub-layer k_k6:

k_k7

Post-softmax, only the top-k_k8 elements are retained (sparsification).

  • Forward Pass Sequence: Multi-Head Attention k_k9 residual addition yt=∑i∈Top-KαiFFNi(ht)+∑j∈sharedβjFFNj(ht)y_t = \sum_{i \in \text{Top-K}} \alpha_i \mathrm{FFN}_i(h_t) + \sum_{j \in \text{shared}} \beta_j \mathrm{FFN}_j(h_t)0 CPL sub-layer A yt=∑i∈Top-KαiFFNi(ht)+∑j∈sharedβjFFNj(ht)y_t = \sum_{i \in \text{Top-K}} \alpha_i \mathrm{FFN}_i(h_t) + \sum_{j \in \text{shared}} \beta_j \mathrm{FFN}_j(h_t)1 residual yt=∑i∈Top-KαiFFNi(ht)+∑j∈sharedβjFFNj(ht)y_t = \sum_{i \in \text{Top-K}} \alpha_i \mathrm{FFN}_i(h_t) + \sum_{j \in \text{shared}} \beta_j \mathrm{FFN}_j(h_t)2 CPL sub-layer B yt=∑i∈Top-KαiFFNi(ht)+∑j∈sharedβjFFNj(ht)y_t = \sum_{i \in \text{Top-K}} \alpha_i \mathrm{FFN}_i(h_t) + \sum_{j \in \text{shared}} \beta_j \mathrm{FFN}_j(h_t)3 residual yt=∑i∈Top-KαiFFNi(ht)+∑j∈sharedβjFFNj(ht)y_t = \sum_{i \in \text{Top-K}} \alpha_i \mathrm{FFN}_i(h_t) + \sum_{j \in \text{shared}} \beta_j \mathrm{FFN}_j(h_t)4 LayerNorm.

4. Training Objective and Regularization

The overall training objective for CartesianMoE integrates the standard cross-entropy language modeling loss and a load-balancing regularizer:

  1. Language Modeling Loss:

yt=∑i∈Top-KαiFFNi(ht)+∑j∈sharedβjFFNj(ht)y_t = \sum_{i \in \text{Top-K}} \alpha_i \mathrm{FFN}_i(h_t) + \sum_{j \in \text{shared}} \beta_j \mathrm{FFN}_j(h_t)5

  1. Load-Balance Loss: Encourages uniform expert utilization, preventing router collapse. For each sub-layer and expert,

yt=∑i∈Top-KαiFFNi(ht)+∑j∈sharedβjFFNj(ht)y_t = \sum_{i \in \text{Top-K}} \alpha_i \mathrm{FFN}_i(h_t) + \sum_{j \in \text{shared}} \beta_j \mathrm{FFN}_j(h_t)6

where

yt=∑i∈Top-KαiFFNi(ht)+∑j∈sharedβjFFNj(ht)y_t = \sum_{i \in \text{Top-K}} \alpha_i \mathrm{FFN}_i(h_t) + \sum_{j \in \text{shared}} \beta_j \mathrm{FFN}_j(h_t)7

  1. Total Objective:

yt=∑i∈Top-KαiFFNi(ht)+∑j∈sharedβjFFNj(ht)y_t = \sum_{i \in \text{Top-K}} \alpha_i \mathrm{FFN}_i(h_t) + \sum_{j \in \text{shared}} \beta_j \mathrm{FFN}_j(h_t)8

5. Empirical Evaluation: Datasets, Baselines, and Quantitative Findings

The empirical analysis is based on pretraining and downstream tasks using The Pile (825 GB, 32k-token LLaMA tokenizer). Models are trained up to 400B tokens.

Model and Baselines:

  • Base Model: 12 layers, yt=∑i∈Top-KαiFFNi(ht)+∑j∈sharedβjFFNj(ht)y_t = \sum_{i \in \text{Top-K}} \alpha_i \mathrm{FFN}_i(h_t) + \sum_{j \in \text{shared}} \beta_j \mathrm{FFN}_j(h_t)9, FFN=3072; MoE replaces every other FFN with CPL.
  • Large Model: 24 layers, ee0, FFN=4096.
  • Baselines include: Dense LLaMA, SMoE-Share (top-2 + 1 shared), SMoE-Top3, Hash Layer (random experts), Fine-grained Routing, and TopP Routing.

Perplexity Results (Pile val. set):

Model MoE-Base MoE-Large
Dense 8.55 6.95
SMoE-Share 7.37 6.13
Fine-grained 7.33 6.16
CartesianMoE 7.19 6.08

Downstream Task Accuracy:

  • MoE-Base: CartesianMoE achieves best performance on 7/8 benchmarks.
  • MoE-Large: Best on 6/8 benchmarks, often improving by 1–2 percentage points over prior MoE variants.

Scalability:

  • At 400B tokens (Large): CartesianMoE reaches perplexity 5.69 vs. 5.78 (Fine-grained).
  • At 7.25B parameters: 4.92 (CartesianMoE) vs. 4.99 (Fine-grained), with CartesianMoE outperforming on all downstream evaluations.

Granularity Study:

Splitting the FFN into ee1 or ee2 sub-experts, CartesianMoE robustly outperforms fine-grained baselines across all granularities.

6. Routing Robustness and Ablation Studies

Disabled Top-1 Expert Test:

Masking out tokens' highest-scoring expert and re-routing:

  • SMoE-Share: Perplexity increases dramatically (6.13 → 75.7),
  • Fine-grained: Rises from 6.16 → 30.9,
  • CartesianMoE: Only rises from 6.08 → 6.35 (ee3).

Shared-Expert Ablation:

Removing the shared expert in CartesianMoE slightly degrades PPL (6.08 → 6.15), establishing a persistent benefit for global shared knowledge even with group-wise sharing.

Granularity Ablation:

CartesianMoE surpasses flat fine-grained routing regardless of granularity, though with very fine splits (ee4), underfitting is observed if over-split.

7. Advantages, Limitations, and Broader Implications

Advantages:

  • Group-wise knowledge sharing: Multiplicative fusion ensures each virtual expert shares sub-structures, smoothing routing noise and promoting generalization.
  • Performance: Consistently superior perplexity and downstream accuracy compared to current MoE variants across all model sizes and data regimes.
  • Robustness: Perplexity remains stable despite routing perturbations.
  • Flexibility: The architecture integrates with existing Transformer backbones, supporting shared experts and standard objectives.

Limitations:

  • Currently restricted to a two-sub-pool Cartesian product; extension to three or more would require increasing hidden-state dimensionality, with risk of sub-expert under-training.
  • Added implementation complexity due to dual routers and residual management.

Broader Implications:

CartesianMoE demonstrates the effectiveness of multiplicative, group-wise knowledge sharing compared to additive formulations. This suggests that as LLM parameter counts grow to the trillion scale, group-wise MoE designs such as CartesianMoE could become foundational for accurate, scalable, and robust sparse model deployments (Su et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CartesianMoE.