CartesianMoE: Cartesian Routing in MoE
- CartesianMoE is a Mixture-of-Experts architecture that partitions experts into two sub-pools and uses Cartesian product routing for enhanced group-wise parameter sharing.
- It mitigates routing noise through a multiplicative fusion of sub-expert pools, leading to lower perplexity and improved downstream accuracy compared to conventional MoE methods.
- Empirical evaluations show significant improvements in perplexity and task performance across diverse model scales, highlighting its robustness and scalability in large language models.
CartesianMoE is a Mixture-of-Experts (MoE) architecture designed to achieve enhanced knowledge sharing, improved robustness to routing noise, and superior performance in LLMs by employing a Cartesian product routing mechanism inspired by collective matrix factorization. In contrast to conventional MoE approaches that aggregate expert outputs additively and often result in isolated expert learning, CartesianMoE leverages a multiplicative fusion of sub-expert pools, enabling group-wise parameter sharing between experts and leading to substantial empirical gains in perplexity and downstream tasks (Su et al., 2024).
1. Limitations of Conventional Mixture-of-Experts Methods
Standard MoE architectures deploy independent expert feedforward networks (FFN), with each expert trained in isolation. This independence inhibits knowledge sharing; as a result, the token representations processed by an MoE layer are primarily determined by the small subset of experts selected by the routing network. Fluctuations in routing—termed "routing noise"—directly affect outputs, raising model sensitivity to routing errors.
Early mitigation approaches, such as SwitchTransformer and GLaM, introduce shared experts whose activations are added to the outputs of dynamically routed experts: While this additive fusion delivers a global component, its sharing is coarse-grained; every routed expert receives the identical shared component, precluding more differentiated, group-wise sharing schemes.
2. Cartesian-Product Routing: Formal Definition and Motivation
Motivated by collective matrix factorization, CartesianMoE partitions the expert pool into two equal sub-pools, denoted A and B, each comprising sub-experts. The effective "expert" at inference or training time corresponds to a unique pairing (i.e., the Cartesian product) of one sub-expert from A and one from B, producing distinct virtual experts. This structure permits parameter and representational sharing across experts that share common sub-expert components.
A single Cartesian Product Layer (CPL) processes inputs as follows:
- Sub-layer A Routing:
where is the input and is the gating score for sub-expert .
- Residual Update and Sub-layer B Routing:
0
- Final Output:
1
This construction associates each computation path through CPL with a particular (A, B) pair, achieving knowledge sharing through multiplicative residual composition.
3. Architectural Specifications
Every Mixture-of-Experts feedforward layer in the Transformer backbone is replaced by a Cartesian Product Layer composed of two serial fine-grained MoE sub-layers:
- Expert Pools: Each sub-layer contains 2 fine-grained (half-sized) FFNs. The intermediate dimension per FFN is set to 3, where 4 is the canonical dense FFN width and 5 the split granularity.
- Shared Experts: Each sub-layer may incorporate a fixed, always-on shared expert for global knowledge assimilation.
- Gating Networks: For each MoE sub-layer 6:
7
Post-softmax, only the top-8 elements are retained (sparsification).
- Forward Pass Sequence: Multi-Head Attention 9 residual addition 0 CPL sub-layer A 1 residual 2 CPL sub-layer B 3 residual 4 LayerNorm.
4. Training Objective and Regularization
The overall training objective for CartesianMoE integrates the standard cross-entropy language modeling loss and a load-balancing regularizer:
- Language Modeling Loss:
5
- Load-Balance Loss: Encourages uniform expert utilization, preventing router collapse. For each sub-layer and expert,
6
where
7
- Total Objective:
8
5. Empirical Evaluation: Datasets, Baselines, and Quantitative Findings
The empirical analysis is based on pretraining and downstream tasks using The Pile (825 GB, 32k-token LLaMA tokenizer). Models are trained up to 400B tokens.
Model and Baselines:
- Base Model: 12 layers, 9, FFN=3072; MoE replaces every other FFN with CPL.
- Large Model: 24 layers, 0, FFN=4096.
- Baselines include: Dense LLaMA, SMoE-Share (top-2 + 1 shared), SMoE-Top3, Hash Layer (random experts), Fine-grained Routing, and TopP Routing.
Perplexity Results (Pile val. set):
| Model | MoE-Base | MoE-Large |
|---|---|---|
| Dense | 8.55 | 6.95 |
| SMoE-Share | 7.37 | 6.13 |
| Fine-grained | 7.33 | 6.16 |
| CartesianMoE | 7.19 | 6.08 |
Downstream Task Accuracy:
- MoE-Base: CartesianMoE achieves best performance on 7/8 benchmarks.
- MoE-Large: Best on 6/8 benchmarks, often improving by 1–2 percentage points over prior MoE variants.
Scalability:
- At 400B tokens (Large): CartesianMoE reaches perplexity 5.69 vs. 5.78 (Fine-grained).
- At 7.25B parameters: 4.92 (CartesianMoE) vs. 4.99 (Fine-grained), with CartesianMoE outperforming on all downstream evaluations.
Granularity Study:
Splitting the FFN into 1 or 2 sub-experts, CartesianMoE robustly outperforms fine-grained baselines across all granularities.
6. Routing Robustness and Ablation Studies
Disabled Top-1 Expert Test:
Masking out tokens' highest-scoring expert and re-routing:
- SMoE-Share: Perplexity increases dramatically (6.13 → 75.7),
- Fine-grained: Rises from 6.16 → 30.9,
- CartesianMoE: Only rises from 6.08 → 6.35 (3).
Shared-Expert Ablation:
Removing the shared expert in CartesianMoE slightly degrades PPL (6.08 → 6.15), establishing a persistent benefit for global shared knowledge even with group-wise sharing.
Granularity Ablation:
CartesianMoE surpasses flat fine-grained routing regardless of granularity, though with very fine splits (4), underfitting is observed if over-split.
7. Advantages, Limitations, and Broader Implications
Advantages:
- Group-wise knowledge sharing: Multiplicative fusion ensures each virtual expert shares sub-structures, smoothing routing noise and promoting generalization.
- Performance: Consistently superior perplexity and downstream accuracy compared to current MoE variants across all model sizes and data regimes.
- Robustness: Perplexity remains stable despite routing perturbations.
- Flexibility: The architecture integrates with existing Transformer backbones, supporting shared experts and standard objectives.
Limitations:
- Currently restricted to a two-sub-pool Cartesian product; extension to three or more would require increasing hidden-state dimensionality, with risk of sub-expert under-training.
- Added implementation complexity due to dual routers and residual management.
Broader Implications:
CartesianMoE demonstrates the effectiveness of multiplicative, group-wise knowledge sharing compared to additive formulations. This suggests that as LLM parameter counts grow to the trillion scale, group-wise MoE designs such as CartesianMoE could become foundational for accurate, scalable, and robust sparse model deployments (Su et al., 2024).