Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cartesian Product Routing in MoE

Updated 18 April 2026
  • Cartesian Product Routing is a mechanism that factorizes expert computation into sequential sub-layers, enhancing knowledge sharing in MoE architectures.
  • It decomposes the feed-forward network into two sub-layers with fine-grained and shared experts, achieving improved perplexity and downstream performance.
  • Empirical evaluations demonstrate that CartesianMoE attains lower perplexity and superior task accuracy with increased routing robustness compared to traditional MoEs.

Cartesian Product Routing is a specialized expert routing and composition mechanism introduced in the context of Transformer-based Mixture-of-Experts (MoE) neural architectures. Its primary function is to enable more effective and robust knowledge sharing across experts in LLMs by factorizing the expert computation into a sequential, multiplication-style composition of sub-experts, drawing formal motivation from collective matrix factorization. This mechanism is embodied in the CartesianMoE architecture, which demonstrates empirically superior perplexity, downstream accuracy, and routing robustness compared to traditional MoE designs (Su et al., 2024).

1. Architectural Principles of Cartesian Product Routing

In CartesianMoE, each MoE layer replaces the standard Feed-Forward Network (FFN) in a Transformer block with a sequence of two sub-layers, each comprising a set of fine-grained sub-experts and a default global “shared expert.” The routing process proceeds as follows:

  1. Input hidden state hlRdh^{l} \in \mathbb{R}^d is processed by sub-layer A, consisting of ee fine-grained FFN sub-experts {FFN1A,,FFNeA}\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}, plus an optional shared FFN. The router r1r_1 selects the top-kk sub-experts for weighted combination.
  2. The result, h^\hat h, is summed residually with the input and normalized to obtain hˉ\bar h via LayerNorm.
  3. Sub-layer B, with another set of ee FFN sub-experts {FFNe+1B,,FFN2eB}\{\mathrm{FFN}^B_{e+1}, \dots, \mathrm{FFN}^B_{2e}\} (plus optional shared), receives hˉ\bar h. Router ee0 again selects top-ee1 sub-experts for combination to produce ee2.
  4. A final residual adds ee3 to the original input and applies LayerNorm to yield the output of the block.

By using ee4 routed experts in each sub-layer, the computation cost matches the baseline MoE with ee5 fine-grained experts, yet the sequential structure yields ee6 implicit composite experts per token, each realized as the composition ee7. This forms the Cartesian product structure.

2. Mathematical Formulation and Composite-Expert View

The transformation across sub-layers can be formalized as follows:

  • Sub-layer A routing:

ee8

ee9

  • Sub-layer B routing:

{FFN1A,,FFNeA}\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}0

{FFN1A,,FFNeA}\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}1

  • Full composite mapping:

{FFN1A,,FFNeA}\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}2

Each composite expert thus corresponds to one ordered pair {FFN1A,,FFNeA}\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}3 drawn from sub-expert sets {FFN1A,,FFNeA}\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}4, with mixture weight {FFN1A,,FFNeA}\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}5, realizing a “multiplicative” combination rather than the additive blending typical of previous shared expert methods.

3. Routing (Gating) Mechanisms and Regularization

The gating functions {FFN1A,,FFNeA}\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}6 and {FFN1A,,FFNeA}\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}7 govern expert selection in sub-layers A and B, respectively. Each gating function operates as follows:

  1. The router computes logits: {FFN1A,,FFNeA}\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}8, with {FFN1A,,FFNeA}\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}9 for A and B.
  2. The logits are converted via softmax to normalized scores: r1r_10.
  3. Top-r1r_11 masking retains only the r1r_12 largest r1r_13, and these are re-normalized to sum to r1r_14.

Each sub-layer includes a “shared” expert whose gate (r1r_15) is always activated, providing a global pathway for information flow. To avoid bottlenecking and promote uniformly distributed routing, load-balance regularization is introduced, penalizing the deviation from equal routing probability distributions across all sub-experts in a batch. The joint objective combines standard cross-entropy loss with a weighted auxiliary load-balance loss:

r1r_16

4. Collective Matrix Factorization Motivation

Cartesian Product Routing draws direct formal inspiration from collective matrix factorization (CMF) [Singh & Gordon 2008]. In CMF, several observation matrices r1r_17 are factorized using shared latent structures, e.g., r1r_18, r1r_19, so kk0 provides shared knowledge across “views.” In CartesianMoE, each full expert is decomposed into a product of two sub-experts, such that:

  • Every composite expert shares sub-expert kk1 across all kk2 pairs and kk3 across all kk4.
  • The always-on shared expert in each factor contributes to global knowledge transfer, while the possible combinations enable group-wise and expert-specific sharing.

By contrast, MoEs with flattened fine-grained experts only offer limited sharing unless explicit shared experts are present, resulting in isolated specialization and less cross-expert knowledge flow.

5. Empirical Findings and Comparative Performance

Empirical investigations on the Pile (100B tokens) and standard downstream NLP benchmarks demonstrate the following:

Model #Params (base/large) #Activated Params Perplexity (base/large)
Dense 162M / 468M 162M / 468M 8.55 / 6.95
MoE (shared) 842M / 2.88B 247M / 770M 7.37 / 6.13
MoE (fine) 842M / 2.88B 247M / 770M 7.33 / 6.16
CartesianMoE 842M / 2.88B 247M / 770M 7.19 / 6.08
  • On eight NLP benchmarks (HellaSwag, LAMBADA, PIQA, StoryCloze, Winogrande, TriviaQA, WebQuestions, NaturalQuestions), CartesianMoE attains the highest zero-/few-shot accuracy on 7/8 tasks in the base regime and 6/8 in the large regime.
  • Routing robustness is demonstrated by masking the top-1 routed expert at inference; CartesianMoE’s perplexity increases only by kk5, compared to kk6 or worse for flat fine-grained MoEs. This suggests improved resilience to routing errors, attributed to enhanced knowledge pathways.
  • Training for extended durations (400B tokens) and scaling up to 7.25B parameters (1.61B activated) preserves or increases CartesianMoE’s advantage, with lower perplexity and systematic improvement on all downstream tasks.

6. Knowledge Sharing and Robustness Analysis

The multiplicative structure of Cartesian Product Routing yields several distinct knowledge-sharing mechanisms:

  • Global sharing: enforced by always-on shared experts in both sub-layers, ensuring universal information flow.
  • Group-wise sharing: every sub-expert in A (resp. B) is shared across all composite experts involving its row (resp. column) in the Cartesian product matrix, supporting modular transfer.
  • Expert-specific specialization: defined by the unique sequential composition of a selected pair kk7.

This configuration mitigates the sensitivity of model predictions to routing decisions and fosters transferability and regularization across tasks and tokens. Compared to additive shared expert schemes, CartesianMoE’s multiplicative gating enables a more nuanced partition of knowledge, with evidence of greater robustness and parameter efficiency (Su et al., 2024).

7. Significance and Broader Implications

Cartesian Product Routing represents a structurally novel approach in MoE architectures, combining routing sparsity, compositionally induced capacity, and collective factorization groundings. Its alignment with matrix factorization concepts offers theoretical insight into parameter sharing and latent factor modeling. The empirical improvements in perplexity, uniform routing, and downstream accuracy highlight its utility for large-scale language modeling under compute constraints. A plausible implication is broader applicability in other domains requiring scalable sparse expert selection with compositional structure.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cartesian Product Routing.