ReXMoE: Efficient Cross-Layer Expert Reuse

Updated 25 May 2026

ReXMoE is a Mixture-of-Experts architecture that reuses experts across adjacent Transformer layers using a progressive scaling routing strategy.
It decouples expert width from routing diversity, allowing richer expert combinations under fixed parameter budgets for improved language modeling and downstream tasks.
Experimental results highlight that a moderate reuse group size (r=4) achieves optimal load balancing and performance gains with minimal additional computational overhead.

ReXMoE (Reusing Experts with Minimal Overhead in Mixture-of-Experts) is a Mixture-of-Experts (MoE) architecture for Transformer-based LLMs that enables the reuse of experts across adjacent layers. By decoupling expert dimensionality from per-layer routing and introducing a progressive scaling routing (PSR) strategy, ReXMoE achieves increased routing diversity and expressiveness under fixed parameter and computational budgets, leading to improved language modeling and downstream task performance with minimal overhead (Tan et al., 20 Oct 2025).

1. Motivation and Background

Traditional MoE frameworks in Transformers assign to each layer $l$ an independent pool of $N$ experts $\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}$ . Token-level routing (e.g., Top- $K$ ) selects a subset of these local experts per token, providing computational efficiency. Under a fixed total parameter budget $P_{\rm tot}$ , there is an inherent trade-off: increasing $N$ enhances routing diversity but reduces the representational capacity of each expert ( $d_f$ ), unless overall model size is increased. As model sizes grow, state-of-the-art MoE LLMs (e.g., Mixtral, Qwen3, DeepSeek-V3) push $N$ to 128–256, fragmenting the FFN into fine-grained experts. However, each layer’s router remains confined to its own local pool, exacerbating the diversity/capacity trade-off and often reducing per-expert width in large $N$ regimes (Tan et al., 20 Oct 2025).

The standard Top- $K$ MoE routing operates via:

$N$ 0

with each token output

$N$ 1

Here, $N$ 2 are routing weights, and each $N$ 3 is an independent FFN with hidden size $N$ 4.

2. ReXMoE Architecture: Cross-Layer Expert Reuse

ReXMoE mitigates the layer-local routing bottleneck by introducing cross-layer expert reuse. Every $N$ 5 consecutive layers are grouped into a reuse block, allowing the router at a given layer $N$ 6 to access a pooled set of $N$ 7 experts, $N$ 8, where $N$ 9 covers $\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}$ 0 adjacent layers. Routing thus operates as:

$\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}$ 1

where $\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}$ 2 indexes into the expert pool pooled from $\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}$ 3 layers, parameterized via $\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}$ 4.

This cross-layer pool allows routers to "look sideways" in the depth dimension, leveraging a much larger candidate set without increasing the total number of FFN parameters. Expert selection therefore becomes combinatorially richer, decoupling routing diversity from per-expert capacity under constant $\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}$ 5.

Symbol	Meaning
$\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}$ 6	Total Transformer layers
$\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}$ 7	Experts per layer
$\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}$ 8	Reuse group size (reuse frequency)
$\mathcal{E}^{(l)} = \{E_{l,1}, \dots, E_{l,N}\}$ 9	Experts local to layer $K$ 0
$K$ 1	Expanded pool across $K$ 2 layers at $K$ 3

3. Progressive Scaling Routing (PSR) Strategy

Directly routing over $K$ 4 experts from training onset introduces severe load imbalance among experts, with many never activated. To address this, PSR gradually increases the candidate set. For training iterations $K$ 5, let $K$ 6 and $K$ 7 denote the PSR start and end steps:

$K$ 8

At each step, $K$ 9 experts in $P_{\rm tot}$ 0 are randomly masked before routing. This curriculum facilitates early specialization (local candidates only), with gradual exploration of cross-layer combinations as $P_{\rm tot}$ 1 rises. PSR thus smooths the path from exploitation to exploration without requiring explicit diversity regularization.

4. Capacity, Complexity, and Scalability

In baseline MoE, total FFN parameters are $P_{\rm tot}$ 2, and routers add $P_{\rm tot}$ 3, giving $P_{\rm tot}$ 4. ReXMoE introduces no new FFN parameters but expands routers to $P_{\rm tot}$ 5, resulting in $P_{\rm tot}$ 6. Since routers are parameter-light compared to FFNs, overhead is modest: $P_{\rm tot}$ 7.

Per-token compute increases by $P_{\rm tot}$ 8 from the wider router, but expert computation cost is unchanged. This enables expert width $P_{\rm tot}$ 9 to remain large even as routing diversity (size of expert pool) scales with $N$ 0. The architecture thus decouples expert width from routing diversity, a key advantage for parameter-efficient scaling.

5. Experimental Protocols

ReXMoE was evaluated across 0.5B, 2.3B, and 7B parameter LLM variants using the following setup:

Architectures: MoE-0.5B-A0.07B (16L, $N$ 1), MoE-2.3B-A0.3B (32L, $N$ 2), MoE-7B-A3B-SE (32L, $N$ 3 routed + 2 shared).
100B-token pretraining dataset (fineweb-edu), batch: 2M tokens, sequence length: 4096.
Optimization: AdamW ( $N$ 4, $N$ 5, weight decay 0.1), gradient clipping 1.0, LR: $N$ 6 warmup to $N$ 7 cosine decay, warmup: 100 steps.
PSR schedule: $N$ 8k steps, $N$ 9k.
Expert parallelism used for routing >8 experts. Hardware: 4 nodes $d_f$ 0 32 Hopper GPUs.

Eval metrics:

Language modeling perplexity (WikiText validation split)
Average and per-task zero-shot accuracy (ARC-Easy, ARC-Challenge, BoolQ, HellaSwag, LAMBADA, LogiQA, OpenBookQA, PIQA, SciQ, SIQA, WinoGrande via lm-eval-harness)
Inference throughput (prefill, decoding via vLLM)

6. Core Results and Ablations

Across all tested model sizes, ReXMoE demonstrates improved downstream performance and perplexity compared to baseline MoE:

Model	Avg. Accuracy	WikiText PPL
MoE-2.3B-A0.3B	49.15%	21.19
ReX-R2	49.65%	—
ReX-R4	50.23%	20.73

Key findings:

ReXMoE-R4 (reuse group $d_f$ 1) yields the highest average accuracy across all scales.
Simple cross-layer reuse ( $d_f$ 2) yields marginal improvement (+0.13% accuracy, small PPL drop).
The PSR curriculum adds a further +0.95% accuracy and significant reduction in perplexity.
On open-source benchmarks, ReX-7B-A3B-SE-R3 trained with 1T tokens matches or surpasses LLaMA-7B on LogiQA, SciQ, and other tasks.
Prefill throughput for short sequences drops up to 15% (increased router overhead); negligible impact on decoding throughput ( $d_f$ 3\% change).
Optimal reuse group size is $d_f$ 4. Larger $d_f$ 5 (16, 32) initially match but later underperform due to expert under-utilization and load imbalance, as measured by Load Balance Violation (LBV).
Activation ratio heatmaps show that ReXMoE enables specialization of certain experts for specific tasks, while vanilla MoE experts remain uniformly utilized.

7. Analysis, Limitations, and Future Prospects

ReXMoE breaks the core limitation of layer-local routing, allowing parameter-efficient decoupling of expert width and routing diversity. Cross-layer reuse of experts, in combination with progressive scaling routing, enables richer expert combinations without increasing FFN parameters and encourages both exploration and expert specialization during training.

Salient points:

The choice of $d_f$ 6 emerges as a strong sweet spot, balancing diversity and overhead.
Excessively large $d_f$ 7 ( $d_f$ 8) causes expert collapse and imbalance unless load-balancing regularization is introduced.
Some prefill I/O overhead marginally reduces inference speed for short context windows.
Future work could enhance large- $d_f$ 9 performance using explicit load-balancing objectives or learned/adaptive groupings (non-uniform $N$ 0). Extending cross-layer reuse to multi-task or continual settings is a promising direction for knowledge sharing.

In summary, ReXMoE establishes a new design dimension in MoE-based LLMs: cross-layer expert reuse, operationalized via a lightweight router parameter extension and a curriculum-based progressive scaling routing schedule. This architecture achieves consistent improvements in language modeling and downstream tasks with minimal incremental overhead, advancing the pursuit of scalable, parameter-efficient MoE LLMs (Tan et al., 20 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

ReXMoE: Reusing Experts with Minimal Overhead in Mixture-of-Experts (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ReXMoE.

ReXMoE: Efficient Cross-Layer Expert Reuse

1. Motivation and Background

2. ReXMoE Architecture: Cross-Layer Expert Reuse

3. Progressive Scaling Routing (PSR) Strategy

4. Capacity, Complexity, and Scalability

5. Experimental Protocols

6. Core Results and Ablations

7. Analysis, Limitations, and Future Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ReXMoE: Efficient Cross-Layer Expert Reuse

1. Motivation and Background

2. ReXMoE Architecture: Cross-Layer Expert Reuse

3. Progressive Scaling Routing (PSR) Strategy

4. Capacity, Complexity, and Scalability

5. Experimental Protocols

6. Core Results and Ablations

7. Analysis, Limitations, and Future Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research