ReXMoE: Efficient Cross-Layer Expert Reuse
- ReXMoE is a Mixture-of-Experts architecture that reuses experts across adjacent Transformer layers using a progressive scaling routing strategy.
- It decouples expert width from routing diversity, allowing richer expert combinations under fixed parameter budgets for improved language modeling and downstream tasks.
- Experimental results highlight that a moderate reuse group size (r=4) achieves optimal load balancing and performance gains with minimal additional computational overhead.
ReXMoE (Reusing Experts with Minimal Overhead in Mixture-of-Experts) is a Mixture-of-Experts (MoE) architecture for Transformer-based LLMs that enables the reuse of experts across adjacent layers. By decoupling expert dimensionality from per-layer routing and introducing a progressive scaling routing (PSR) strategy, ReXMoE achieves increased routing diversity and expressiveness under fixed parameter and computational budgets, leading to improved language modeling and downstream task performance with minimal overhead (Tan et al., 20 Oct 2025).
1. Motivation and Background
Traditional MoE frameworks in Transformers assign to each layer an independent pool of experts . Token-level routing (e.g., Top-) selects a subset of these local experts per token, providing computational efficiency. Under a fixed total parameter budget , there is an inherent trade-off: increasing enhances routing diversity but reduces the representational capacity of each expert (), unless overall model size is increased. As model sizes grow, state-of-the-art MoE LLMs (e.g., Mixtral, Qwen3, DeepSeek-V3) push to 128–256, fragmenting the FFN into fine-grained experts. However, each layer’s router remains confined to its own local pool, exacerbating the diversity/capacity trade-off and often reducing per-expert width in large regimes (Tan et al., 20 Oct 2025).
The standard Top- MoE routing operates via:
0
with each token output
1
Here, 2 are routing weights, and each 3 is an independent FFN with hidden size 4.
2. ReXMoE Architecture: Cross-Layer Expert Reuse
ReXMoE mitigates the layer-local routing bottleneck by introducing cross-layer expert reuse. Every 5 consecutive layers are grouped into a reuse block, allowing the router at a given layer 6 to access a pooled set of 7 experts, 8, where 9 covers 0 adjacent layers. Routing thus operates as:
1
where 2 indexes into the expert pool pooled from 3 layers, parameterized via 4.
This cross-layer pool allows routers to "look sideways" in the depth dimension, leveraging a much larger candidate set without increasing the total number of FFN parameters. Expert selection therefore becomes combinatorially richer, decoupling routing diversity from per-expert capacity under constant 5.
| Symbol | Meaning |
|---|---|
| 6 | Total Transformer layers |
| 7 | Experts per layer |
| 8 | Reuse group size (reuse frequency) |
| 9 | Experts local to layer 0 |
| 1 | Expanded pool across 2 layers at 3 |
3. Progressive Scaling Routing (PSR) Strategy
Directly routing over 4 experts from training onset introduces severe load imbalance among experts, with many never activated. To address this, PSR gradually increases the candidate set. For training iterations 5, let 6 and 7 denote the PSR start and end steps:
8
At each step, 9 experts in 0 are randomly masked before routing. This curriculum facilitates early specialization (local candidates only), with gradual exploration of cross-layer combinations as 1 rises. PSR thus smooths the path from exploitation to exploration without requiring explicit diversity regularization.
4. Capacity, Complexity, and Scalability
In baseline MoE, total FFN parameters are 2, and routers add 3, giving 4. ReXMoE introduces no new FFN parameters but expands routers to 5, resulting in 6. Since routers are parameter-light compared to FFNs, overhead is modest: 7.
Per-token compute increases by 8 from the wider router, but expert computation cost is unchanged. This enables expert width 9 to remain large even as routing diversity (size of expert pool) scales with 0. The architecture thus decouples expert width from routing diversity, a key advantage for parameter-efficient scaling.
5. Experimental Protocols
ReXMoE was evaluated across 0.5B, 2.3B, and 7B parameter LLM variants using the following setup:
- Architectures: MoE-0.5B-A0.07B (16L, 1), MoE-2.3B-A0.3B (32L, 2), MoE-7B-A3B-SE (32L, 3 routed + 2 shared).
- 100B-token pretraining dataset (fineweb-edu), batch: 2M tokens, sequence length: 4096.
- Optimization: AdamW (4, 5, weight decay 0.1), gradient clipping 1.0, LR: 6 warmup to 7 cosine decay, warmup: 100 steps.
- PSR schedule: 8k steps, 9k.
- Expert parallelism used for routing >8 experts. Hardware: 4 nodes 0 32 Hopper GPUs.
Eval metrics:
- Language modeling perplexity (WikiText validation split)
- Average and per-task zero-shot accuracy (ARC-Easy, ARC-Challenge, BoolQ, HellaSwag, LAMBADA, LogiQA, OpenBookQA, PIQA, SciQ, SIQA, WinoGrande via lm-eval-harness)
- Inference throughput (prefill, decoding via vLLM)
6. Core Results and Ablations
Across all tested model sizes, ReXMoE demonstrates improved downstream performance and perplexity compared to baseline MoE:
| Model | Avg. Accuracy | WikiText PPL |
|---|---|---|
| MoE-2.3B-A0.3B | 49.15% | 21.19 |
| ReX-R2 | 49.65% | — |
| ReX-R4 | 50.23% | 20.73 |
Key findings:
- ReXMoE-R4 (reuse group 1) yields the highest average accuracy across all scales.
- Simple cross-layer reuse (2) yields marginal improvement (+0.13% accuracy, small PPL drop).
- The PSR curriculum adds a further +0.95% accuracy and significant reduction in perplexity.
- On open-source benchmarks, ReX-7B-A3B-SE-R3 trained with 1T tokens matches or surpasses LLaMA-7B on LogiQA, SciQ, and other tasks.
- Prefill throughput for short sequences drops up to 15% (increased router overhead); negligible impact on decoding throughput (3\% change).
- Optimal reuse group size is 4. Larger 5 (16, 32) initially match but later underperform due to expert under-utilization and load imbalance, as measured by Load Balance Violation (LBV).
- Activation ratio heatmaps show that ReXMoE enables specialization of certain experts for specific tasks, while vanilla MoE experts remain uniformly utilized.
7. Analysis, Limitations, and Future Prospects
ReXMoE breaks the core limitation of layer-local routing, allowing parameter-efficient decoupling of expert width and routing diversity. Cross-layer reuse of experts, in combination with progressive scaling routing, enables richer expert combinations without increasing FFN parameters and encourages both exploration and expert specialization during training.
Salient points:
- The choice of 6 emerges as a strong sweet spot, balancing diversity and overhead.
- Excessively large 7 (8) causes expert collapse and imbalance unless load-balancing regularization is introduced.
- Some prefill I/O overhead marginally reduces inference speed for short context windows.
- Future work could enhance large-9 performance using explicit load-balancing objectives or learned/adaptive groupings (non-uniform 0). Extending cross-layer reuse to multi-task or continual settings is a promising direction for knowledge sharing.
In summary, ReXMoE establishes a new design dimension in MoE-based LLMs: cross-layer expert reuse, operationalized via a lightweight router parameter extension and a curriculum-based progressive scaling routing schedule. This architecture achieves consistent improvements in language modeling and downstream tasks with minimal incremental overhead, advancing the pursuit of scalable, parameter-efficient MoE LLMs (Tan et al., 20 Oct 2025).