Cartesian Product Routing in MoE
- Cartesian Product Routing is a mechanism that factorizes expert computation into sequential sub-layers, enhancing knowledge sharing in MoE architectures.
- It decomposes the feed-forward network into two sub-layers with fine-grained and shared experts, achieving improved perplexity and downstream performance.
- Empirical evaluations demonstrate that CartesianMoE attains lower perplexity and superior task accuracy with increased routing robustness compared to traditional MoEs.
Cartesian Product Routing is a specialized expert routing and composition mechanism introduced in the context of Transformer-based Mixture-of-Experts (MoE) neural architectures. Its primary function is to enable more effective and robust knowledge sharing across experts in LLMs by factorizing the expert computation into a sequential, multiplication-style composition of sub-experts, drawing formal motivation from collective matrix factorization. This mechanism is embodied in the CartesianMoE architecture, which demonstrates empirically superior perplexity, downstream accuracy, and routing robustness compared to traditional MoE designs (Su et al., 2024).
1. Architectural Principles of Cartesian Product Routing
In CartesianMoE, each MoE layer replaces the standard Feed-Forward Network (FFN) in a Transformer block with a sequence of two sub-layers, each comprising a set of fine-grained sub-experts and a default global “shared expert.” The routing process proceeds as follows:
- Input hidden state is processed by sub-layer A, consisting of fine-grained FFN sub-experts , plus an optional shared FFN. The router selects the top- sub-experts for weighted combination.
- The result, , is summed residually with the input and normalized to obtain via LayerNorm.
- Sub-layer B, with another set of FFN sub-experts (plus optional shared), receives . Router 0 again selects top-1 sub-experts for combination to produce 2.
- A final residual adds 3 to the original input and applies LayerNorm to yield the output of the block.
By using 4 routed experts in each sub-layer, the computation cost matches the baseline MoE with 5 fine-grained experts, yet the sequential structure yields 6 implicit composite experts per token, each realized as the composition 7. This forms the Cartesian product structure.
2. Mathematical Formulation and Composite-Expert View
The transformation across sub-layers can be formalized as follows:
- Sub-layer A routing:
8
9
- Sub-layer B routing:
0
1
- Full composite mapping:
2
Each composite expert thus corresponds to one ordered pair 3 drawn from sub-expert sets 4, with mixture weight 5, realizing a “multiplicative” combination rather than the additive blending typical of previous shared expert methods.
3. Routing (Gating) Mechanisms and Regularization
The gating functions 6 and 7 govern expert selection in sub-layers A and B, respectively. Each gating function operates as follows:
- The router computes logits: 8, with 9 for A and B.
- The logits are converted via softmax to normalized scores: 0.
- Top-1 masking retains only the 2 largest 3, and these are re-normalized to sum to 4.
Each sub-layer includes a “shared” expert whose gate (5) is always activated, providing a global pathway for information flow. To avoid bottlenecking and promote uniformly distributed routing, load-balance regularization is introduced, penalizing the deviation from equal routing probability distributions across all sub-experts in a batch. The joint objective combines standard cross-entropy loss with a weighted auxiliary load-balance loss:
6
4. Collective Matrix Factorization Motivation
Cartesian Product Routing draws direct formal inspiration from collective matrix factorization (CMF) [Singh & Gordon 2008]. In CMF, several observation matrices 7 are factorized using shared latent structures, e.g., 8, 9, so 0 provides shared knowledge across “views.” In CartesianMoE, each full expert is decomposed into a product of two sub-experts, such that:
- Every composite expert shares sub-expert 1 across all 2 pairs and 3 across all 4.
- The always-on shared expert in each factor contributes to global knowledge transfer, while the possible combinations enable group-wise and expert-specific sharing.
By contrast, MoEs with flattened fine-grained experts only offer limited sharing unless explicit shared experts are present, resulting in isolated specialization and less cross-expert knowledge flow.
5. Empirical Findings and Comparative Performance
Empirical investigations on the Pile (100B tokens) and standard downstream NLP benchmarks demonstrate the following:
| Model | #Params (base/large) | #Activated Params | Perplexity (base/large) |
|---|---|---|---|
| Dense | 162M / 468M | 162M / 468M | 8.55 / 6.95 |
| MoE (shared) | 842M / 2.88B | 247M / 770M | 7.37 / 6.13 |
| MoE (fine) | 842M / 2.88B | 247M / 770M | 7.33 / 6.16 |
| CartesianMoE | 842M / 2.88B | 247M / 770M | 7.19 / 6.08 |
- On eight NLP benchmarks (HellaSwag, LAMBADA, PIQA, StoryCloze, Winogrande, TriviaQA, WebQuestions, NaturalQuestions), CartesianMoE attains the highest zero-/few-shot accuracy on 7/8 tasks in the base regime and 6/8 in the large regime.
- Routing robustness is demonstrated by masking the top-1 routed expert at inference; CartesianMoE’s perplexity increases only by 5, compared to 6 or worse for flat fine-grained MoEs. This suggests improved resilience to routing errors, attributed to enhanced knowledge pathways.
- Training for extended durations (400B tokens) and scaling up to 7.25B parameters (1.61B activated) preserves or increases CartesianMoE’s advantage, with lower perplexity and systematic improvement on all downstream tasks.
6. Knowledge Sharing and Robustness Analysis
The multiplicative structure of Cartesian Product Routing yields several distinct knowledge-sharing mechanisms:
- Global sharing: enforced by always-on shared experts in both sub-layers, ensuring universal information flow.
- Group-wise sharing: every sub-expert in A (resp. B) is shared across all composite experts involving its row (resp. column) in the Cartesian product matrix, supporting modular transfer.
- Expert-specific specialization: defined by the unique sequential composition of a selected pair 7.
This configuration mitigates the sensitivity of model predictions to routing decisions and fosters transferability and regularization across tasks and tokens. Compared to additive shared expert schemes, CartesianMoE’s multiplicative gating enables a more nuanced partition of knowledge, with evidence of greater robustness and parameter efficiency (Su et al., 2024).
7. Significance and Broader Implications
Cartesian Product Routing represents a structurally novel approach in MoE architectures, combining routing sparsity, compositionally induced capacity, and collective factorization groundings. Its alignment with matrix factorization concepts offers theoretical insight into parameter sharing and latent factor modeling. The empirical improvements in perplexity, uniform routing, and downstream accuracy highlight its utility for large-scale language modeling under compute constraints. A plausible implication is broader applicability in other domains requiring scalable sparse expert selection with compositional structure.