Cartesian Product Routing in MoE

Updated 18 April 2026

Cartesian Product Routing is a mechanism that factorizes expert computation into sequential sub-layers, enhancing knowledge sharing in MoE architectures.
It decomposes the feed-forward network into two sub-layers with fine-grained and shared experts, achieving improved perplexity and downstream performance.
Empirical evaluations demonstrate that CartesianMoE attains lower perplexity and superior task accuracy with increased routing robustness compared to traditional MoEs.

Cartesian Product Routing is a specialized expert routing and composition mechanism introduced in the context of Transformer-based Mixture-of-Experts (MoE) neural architectures. Its primary function is to enable more effective and robust knowledge sharing across experts in LLMs by factorizing the expert computation into a sequential, multiplication-style composition of sub-experts, drawing formal motivation from collective matrix factorization. This mechanism is embodied in the CartesianMoE architecture, which demonstrates empirically superior perplexity, downstream accuracy, and routing robustness compared to traditional MoE designs (Su et al., 2024).

1. Architectural Principles of Cartesian Product Routing

In CartesianMoE, each MoE layer replaces the standard Feed-Forward Network (FFN) in a Transformer block with a sequence of two sub-layers, each comprising a set of fine-grained sub-experts and a default global “shared expert.” The routing process proceeds as follows:

Input hidden state $h^{l} \in \mathbb{R}^d$ is processed by sub-layer A, consisting of $e$ fine-grained FFN sub-experts $\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}$ , plus an optional shared FFN. The router $r_1$ selects the top- $k$ sub-experts for weighted combination.
The result, $\hat h$ , is summed residually with the input and normalized to obtain $\bar h$ via LayerNorm.
Sub-layer B, with another set of $e$ FFN sub-experts $\{\mathrm{FFN}^B_{e+1}, \dots, \mathrm{FFN}^B_{2e}\}$ (plus optional shared), receives $\bar h$ . Router $e$ 0 again selects top- $e$ 1 sub-experts for combination to produce $e$ 2.
A final residual adds $e$ 3 to the original input and applies LayerNorm to yield the output of the block.

By using $e$ 4 routed experts in each sub-layer, the computation cost matches the baseline MoE with $e$ 5 fine-grained experts, yet the sequential structure yields $e$ 6 implicit composite experts per token, each realized as the composition $e$ 7. This forms the Cartesian product structure.

2. Mathematical Formulation and Composite-Expert View

The transformation across sub-layers can be formalized as follows:

Sub-layer A routing:

$e$ 8

$e$ 9

Sub-layer B routing:

$\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}$ 0

$\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}$ 1

Full composite mapping:

$\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}$ 2

Each composite expert thus corresponds to one ordered pair $\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}$ 3 drawn from sub-expert sets $\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}$ 4, with mixture weight $\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}$ 5, realizing a “multiplicative” combination rather than the additive blending typical of previous shared expert methods.

3. Routing (Gating) Mechanisms and Regularization

The gating functions $\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}$ 6 and $\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}$ 7 govern expert selection in sub-layers A and B, respectively. Each gating function operates as follows:

The router computes logits: $\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}$ 8, with $\{\mathrm{FFN}^A_1, \dots, \mathrm{FFN}^A_e\}$ 9 for A and B.
The logits are converted via softmax to normalized scores: $r_1$ 0.
Top- $r_1$ 1 masking retains only the $r_1$ 2 largest $r_1$ 3, and these are re-normalized to sum to $r_1$ 4.

Each sub-layer includes a “shared” expert whose gate ( $r_1$ 5) is always activated, providing a global pathway for information flow. To avoid bottlenecking and promote uniformly distributed routing, load-balance regularization is introduced, penalizing the deviation from equal routing probability distributions across all sub-experts in a batch. The joint objective combines standard cross-entropy loss with a weighted auxiliary load-balance loss:

$r_1$ 6

4. Collective Matrix Factorization Motivation

Cartesian Product Routing draws direct formal inspiration from collective matrix factorization (CMF) [Singh & Gordon 2008]. In CMF, several observation matrices $r_1$ 7 are factorized using shared latent structures, e.g., $r_1$ 8, $r_1$ 9, so $k$ 0 provides shared knowledge across “views.” In CartesianMoE, each full expert is decomposed into a product of two sub-experts, such that:

Every composite expert shares sub-expert $k$ 1 across all $k$ 2 pairs and $k$ 3 across all $k$ 4.
The always-on shared expert in each factor contributes to global knowledge transfer, while the possible combinations enable group-wise and expert-specific sharing.

By contrast, MoEs with flattened fine-grained experts only offer limited sharing unless explicit shared experts are present, resulting in isolated specialization and less cross-expert knowledge flow.

5. Empirical Findings and Comparative Performance

Empirical investigations on the Pile (100B tokens) and standard downstream NLP benchmarks demonstrate the following:

Model	#Params (base/large)	#Activated Params	Perplexity (base/large)
Dense	162M / 468M	162M / 468M	8.55 / 6.95
MoE (shared)	842M / 2.88B	247M / 770M	7.37 / 6.13
MoE (fine)	842M / 2.88B	247M / 770M	7.33 / 6.16
CartesianMoE	842M / 2.88B	247M / 770M	7.19 / 6.08

On eight NLP benchmarks (HellaSwag, LAMBADA, PIQA, StoryCloze, Winogrande, TriviaQA, WebQuestions, NaturalQuestions), CartesianMoE attains the highest zero-/few-shot accuracy on 7/8 tasks in the base regime and 6/8 in the large regime.
Routing robustness is demonstrated by masking the top-1 routed expert at inference; CartesianMoE’s perplexity increases only by $k$ 5, compared to $k$ 6 or worse for flat fine-grained MoEs. This suggests improved resilience to routing errors, attributed to enhanced knowledge pathways.
Training for extended durations (400B tokens) and scaling up to 7.25B parameters (1.61B activated) preserves or increases CartesianMoE’s advantage, with lower perplexity and systematic improvement on all downstream tasks.

The multiplicative structure of Cartesian Product Routing yields several distinct knowledge-sharing mechanisms:

Global sharing: enforced by always-on shared experts in both sub-layers, ensuring universal information flow.
Group-wise sharing: every sub-expert in A (resp. B) is shared across all composite experts involving its row (resp. column) in the Cartesian product matrix, supporting modular transfer.
Expert-specific specialization: defined by the unique sequential composition of a selected pair $k$ 7.

This configuration mitigates the sensitivity of model predictions to routing decisions and fosters transferability and regularization across tasks and tokens. Compared to additive shared expert schemes, CartesianMoE’s multiplicative gating enables a more nuanced partition of knowledge, with evidence of greater robustness and parameter efficiency (Su et al., 2024).

7. Significance and Broader Implications

Cartesian Product Routing represents a structurally novel approach in MoE architectures, combining routing sparsity, compositionally induced capacity, and collective factorization groundings. Its alignment with matrix factorization concepts offers theoretical insight into parameter sharing and latent factor modeling. The empirical improvements in perplexity, uniform routing, and downstream accuracy highlight its utility for large-scale language modeling under compute constraints. A plausible implication is broader applicability in other domains requiring scalable sparse expert selection with compositional structure.

Markdown Report Issue Upgrade to Chat

References (1)

CartesianMoE: Boosting Knowledge Sharing among Experts via Cartesian Product Routing in Mixture-of-Experts (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cartesian Product Routing.

Cartesian Product Routing in MoE

1. Architectural Principles of Cartesian Product Routing

2. Mathematical Formulation and Composite-Expert View

3. Routing (Gating) Mechanisms and Regularization

4. Collective Matrix Factorization Motivation

5. Empirical Findings and Comparative Performance

7. Significance and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Cartesian Product Routing in MoE

1. Architectural Principles of Cartesian Product Routing

2. Mathematical Formulation and Composite-Expert View

3. Routing (Gating) Mechanisms and Regularization

4. Collective Matrix Factorization Motivation

5. Empirical Findings and Comparative Performance

6. Knowledge Sharing and Robustness Analysis

7. Significance and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research