Papers
Topics
Authors
Recent
Search
2000 character limit reached

ReXMoE: Efficient Cross-Layer Expert Reuse

Updated 25 May 2026
  • ReXMoE is a Mixture-of-Experts architecture that reuses experts across adjacent Transformer layers using a progressive scaling routing strategy.
  • It decouples expert width from routing diversity, allowing richer expert combinations under fixed parameter budgets for improved language modeling and downstream tasks.
  • Experimental results highlight that a moderate reuse group size (r=4) achieves optimal load balancing and performance gains with minimal additional computational overhead.

ReXMoE (Reusing Experts with Minimal Overhead in Mixture-of-Experts) is a Mixture-of-Experts (MoE) architecture for Transformer-based LLMs that enables the reuse of experts across adjacent layers. By decoupling expert dimensionality from per-layer routing and introducing a progressive scaling routing (PSR) strategy, ReXMoE achieves increased routing diversity and expressiveness under fixed parameter and computational budgets, leading to improved language modeling and downstream task performance with minimal overhead (Tan et al., 20 Oct 2025).

1. Motivation and Background

Traditional MoE frameworks in Transformers assign to each layer ll an independent pool of NN experts E(l)={El,1,…,El,N}\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}. Token-level routing (e.g., Top-KK) selects a subset of these local experts per token, providing computational efficiency. Under a fixed total parameter budget PtotP_{\rm tot}, there is an inherent trade-off: increasing NN enhances routing diversity but reduces the representational capacity of each expert (dfd_f), unless overall model size is increased. As model sizes grow, state-of-the-art MoE LLMs (e.g., Mixtral, Qwen3, DeepSeek-V3) push NN to 128–256, fragmenting the FFN into fine-grained experts. However, each layer’s router remains confined to its own local pool, exacerbating the diversity/capacity trade-off and often reducing per-expert width in large NN regimes (Tan et al., 20 Oct 2025).

The standard Top-KK MoE routing operates via:

NN0

with each token output

NN1

Here, NN2 are routing weights, and each NN3 is an independent FFN with hidden size NN4.

2. ReXMoE Architecture: Cross-Layer Expert Reuse

ReXMoE mitigates the layer-local routing bottleneck by introducing cross-layer expert reuse. Every NN5 consecutive layers are grouped into a reuse block, allowing the router at a given layer NN6 to access a pooled set of NN7 experts, NN8, where NN9 covers E(l)={El,1,…,El,N}\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}0 adjacent layers. Routing thus operates as:

E(l)={El,1,…,El,N}\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}1

where E(l)={El,1,…,El,N}\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}2 indexes into the expert pool pooled from E(l)={El,1,…,El,N}\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}3 layers, parameterized via E(l)={El,1,…,El,N}\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}4.

This cross-layer pool allows routers to "look sideways" in the depth dimension, leveraging a much larger candidate set without increasing the total number of FFN parameters. Expert selection therefore becomes combinatorially richer, decoupling routing diversity from per-expert capacity under constant E(l)={El,1,…,El,N}\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}5.

Symbol Meaning
E(l)={El,1,…,El,N}\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}6 Total Transformer layers
E(l)={El,1,…,El,N}\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}7 Experts per layer
E(l)={El,1,…,El,N}\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}8 Reuse group size (reuse frequency)
E(l)={El,1,…,El,N}\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}9 Experts local to layer KK0
KK1 Expanded pool across KK2 layers at KK3

3. Progressive Scaling Routing (PSR) Strategy

Directly routing over KK4 experts from training onset introduces severe load imbalance among experts, with many never activated. To address this, PSR gradually increases the candidate set. For training iterations KK5, let KK6 and KK7 denote the PSR start and end steps:

KK8

At each step, KK9 experts in PtotP_{\rm tot}0 are randomly masked before routing. This curriculum facilitates early specialization (local candidates only), with gradual exploration of cross-layer combinations as PtotP_{\rm tot}1 rises. PSR thus smooths the path from exploitation to exploration without requiring explicit diversity regularization.

4. Capacity, Complexity, and Scalability

In baseline MoE, total FFN parameters are PtotP_{\rm tot}2, and routers add PtotP_{\rm tot}3, giving PtotP_{\rm tot}4. ReXMoE introduces no new FFN parameters but expands routers to PtotP_{\rm tot}5, resulting in PtotP_{\rm tot}6. Since routers are parameter-light compared to FFNs, overhead is modest: PtotP_{\rm tot}7.

Per-token compute increases by PtotP_{\rm tot}8 from the wider router, but expert computation cost is unchanged. This enables expert width PtotP_{\rm tot}9 to remain large even as routing diversity (size of expert pool) scales with NN0. The architecture thus decouples expert width from routing diversity, a key advantage for parameter-efficient scaling.

5. Experimental Protocols

ReXMoE was evaluated across 0.5B, 2.3B, and 7B parameter LLM variants using the following setup:

  • Architectures: MoE-0.5B-A0.07B (16L, NN1), MoE-2.3B-A0.3B (32L, NN2), MoE-7B-A3B-SE (32L, NN3 routed + 2 shared).
  • 100B-token pretraining dataset (fineweb-edu), batch: 2M tokens, sequence length: 4096.
  • Optimization: AdamW (NN4, NN5, weight decay 0.1), gradient clipping 1.0, LR: NN6 warmup to NN7 cosine decay, warmup: 100 steps.
  • PSR schedule: NN8k steps, NN9k.
  • Expert parallelism used for routing >8 experts. Hardware: 4 nodes dfd_f0 32 Hopper GPUs.

Eval metrics:

  • Language modeling perplexity (WikiText validation split)
  • Average and per-task zero-shot accuracy (ARC-Easy, ARC-Challenge, BoolQ, HellaSwag, LAMBADA, LogiQA, OpenBookQA, PIQA, SciQ, SIQA, WinoGrande via lm-eval-harness)
  • Inference throughput (prefill, decoding via vLLM)

6. Core Results and Ablations

Across all tested model sizes, ReXMoE demonstrates improved downstream performance and perplexity compared to baseline MoE:

Model Avg. Accuracy WikiText PPL
MoE-2.3B-A0.3B 49.15% 21.19
ReX-R2 49.65% —
ReX-R4 50.23% 20.73

Key findings:

  • ReXMoE-R4 (reuse group dfd_f1) yields the highest average accuracy across all scales.
  • Simple cross-layer reuse (dfd_f2) yields marginal improvement (+0.13% accuracy, small PPL drop).
  • The PSR curriculum adds a further +0.95% accuracy and significant reduction in perplexity.
  • On open-source benchmarks, ReX-7B-A3B-SE-R3 trained with 1T tokens matches or surpasses LLaMA-7B on LogiQA, SciQ, and other tasks.
  • Prefill throughput for short sequences drops up to 15% (increased router overhead); negligible impact on decoding throughput (dfd_f3\% change).
  • Optimal reuse group size is dfd_f4. Larger dfd_f5 (16, 32) initially match but later underperform due to expert under-utilization and load imbalance, as measured by Load Balance Violation (LBV).
  • Activation ratio heatmaps show that ReXMoE enables specialization of certain experts for specific tasks, while vanilla MoE experts remain uniformly utilized.

7. Analysis, Limitations, and Future Prospects

ReXMoE breaks the core limitation of layer-local routing, allowing parameter-efficient decoupling of expert width and routing diversity. Cross-layer reuse of experts, in combination with progressive scaling routing, enables richer expert combinations without increasing FFN parameters and encourages both exploration and expert specialization during training.

Salient points:

  • The choice of dfd_f6 emerges as a strong sweet spot, balancing diversity and overhead.
  • Excessively large dfd_f7 (dfd_f8) causes expert collapse and imbalance unless load-balancing regularization is introduced.
  • Some prefill I/O overhead marginally reduces inference speed for short context windows.
  • Future work could enhance large-dfd_f9 performance using explicit load-balancing objectives or learned/adaptive groupings (non-uniform NN0). Extending cross-layer reuse to multi-task or continual settings is a promising direction for knowledge sharing.

In summary, ReXMoE establishes a new design dimension in MoE-based LLMs: cross-layer expert reuse, operationalized via a lightweight router parameter extension and a curriculum-based progressive scaling routing schedule. This architecture achieves consistent improvements in language modeling and downstream tasks with minimal incremental overhead, advancing the pursuit of scalable, parameter-efficient MoE LLMs (Tan et al., 20 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ReXMoE.