Hierarchical Shared-Routed MoE

Updated 30 November 2025

Hierarchical Shared-Routed MoE is a sparse, multi-level expert selection architecture that leverages both global and local routing to optimize parameter efficiency.
It features shared routing mechanisms across layers, spatial locations, or expert groups to enhance robustness, promote specialization, and reduce overfitting.
Practical applications include improved speech recognition, neural machine translation, MRI reconstruction, and CTR prediction with demonstrated gains in WER, BLEU, and PSNR metrics.

A Hierarchical Shared-Routed Mixture-of-Experts (MoE) is a sparse conditional computation architecture in which expert selection is controlled through multiple levels of routing, often leveraging both global and local (or context/task-specific) information, and is frequently augmented by sharing router parameters across depth, experts, or spatial locations. Recent advances articulate a variety of hierarchical and shared routing mechanisms, aiming to maximize parameter efficiency, expert specialization, robust scaling, and deployment flexibility across diverse domains such as speech recognition, neural machine translation, MRI image reconstruction, and CTR prediction. Distinct instantiations include DLG-MoE (Dynamic Language Group-MoE) for code-switching ASR (Huang et al., 26 Jul 2024), SpeechMoE (You et al., 2021), THOR-MoE for NMT (Liang et al., 20 May 2025), HiFi-MambaV2 for MRI (Fang et al., 23 Nov 2025), Omni-router Transformer (Gu et al., 8 Jul 2025), HiLoMoE for CTR prediction (Zeng et al., 12 Oct 2025), and MixER for dynamical system learning (Nzoyem et al., 7 Feb 2025). Despite differing in application, these models share an emphasis on hierarchically structured, often shared and/or cross-depth routers, enabling modular and adaptive expert utilization under parameter and compute constraints.

1. Architectural Foundations and Routing Hierarchies

Hierarchical shared-routed MoE architectures interpose multiple routing stages prior to expert invocation. Canonical instances involve:

Task/group-level routers, which partition tokens/frames/samples into coarse semantic or linguistic groups (e.g., language ID in DLG-MoE (Huang et al., 26 Jul 2024), domain/language in THOR-MoE (Liang et al., 20 May 2025)).
Local or unsupervised routers, which further distribute within-group (or within-domain) data across group-specific experts based on learned, often latent, features.

In models such as DLG-MoE, a "shared Language Router" provides framewise group assignment using multitask CTC-trained heads, followed by a per-group unsupervised router (unshared) dispatching to $k$ of $n_g$ group experts. THOR-MoE analogously introduces a task predictor (sentence-level) that establishes a soft mask over experts, with token-level, context-augmented routers then acting on the pruned expert set (Liang et al., 20 May 2025).

The shared router paradigm refers to the parameterization of the routing function (e.g., a single projection matrix or convolutional kernel) that is reused either across model depth (Omni-router (Gu et al., 8 Jul 2025)), spatial location (HiFi-MambaV2 (Fang et al., 23 Nov 2025)), or among multiple MoE blocks (HiLoMoE (Zeng et al., 12 Oct 2025)). This encouragement of inter-layer or inter-location consistency increases cross-depth/global cooperation, reduces overfitting, and has empirically been found to promote robust expert specialization.

2. Routing and Gating Mechanism Details

Expert routing is typically realized via a sequence of gating functions, which may be comprised of softmaxes, top- $k$ selections, or explicit clustering-matching algorithms. Core features include:

Explicit group assignment (DLG-MoE, THOR-MoE): Language/domain routers operate on high-level features and allocate each token/frame to a group (e.g., Zh vs. En).
Sparse local dispatching: Within each group, a linear router projects features to expert logits, from which a top- $k$ or top-1 subset is selected, with gating weights formed by normalizing (softmax) over the selected subset.

Formal notation, e.g., in DLG-MoE (Huang et al., 26 Jul 2024):

For group $g$ , the router score $s_t^g = h_t^g W_R^g$ ,
Top- $k$ indices $I_t = \text{TopKIndices}(s_t^g, k)$ ,
Gating $G_t[I_t] = \text{Softmax}(s_t[I_t])$ ,
Output $h_{out,t}^g = \sum_{i \in I_t} G_t[i] E_i^g(h_t^g)$ .

Shared routers use the same parameter set across spatial positions or depth, e.g., Omni-router's $W_r^{\text{shared}}$ for all Transformer layers (Gu et al., 8 Jul 2025), or HiFi-MambaV2's convolutional local router applied identically across all pixels and groups (Fang et al., 23 Nov 2025). In MixER (Nzoyem et al., 7 Feb 2025), top-1 gating is optimized via K-means clustering of context embeddings, followed by a least-squares fit.

Dynamic top- $k$ (as in DLG-MoE and THOR-MoE) enables the selection of differing numbers of experts at training and inference, providing a trade-off between computation and model capacity.

Expert networks are generally independent per group or layer, parameterized as 2-layer FFNs (e.g., Conformer FFN in DLG-MoE; rank-1 LoRA in HiLoMoE), or lightweight channelwise MLPs (HiFi-MambaV2 (Fang et al., 23 Nov 2025)). Parameter sharing can occur:

In base weights (HiLoMoE): All layers share a base matrix, with per-layer, per-expert LoRA updates (Zeng et al., 12 Oct 2025).
In routers: Parameterizing routers with shared kernels or matrices across layers/positions enforces consistency.

Expert specialization is potentiated by this structuring. For instance, in HiFi-MambaV2, shared experts learn globally stable anatomical/frequency priors, while routed (local) experts specialize to textural or boundary phenomena at specific spatial regions and scales (Fang et al., 23 Nov 2025). In NLP/ASR, group- or task-level routers relieve within-group routers from handling global variation, enabling focus on latent style, accent, or domain variations.

4. Training Objectives, Load Balancing, and Regularization

Hierarchical shared-routed MoE requires regularization to prevent expert collapse and ensure effective capacity utilization:

Load balancing: Auxiliary losses penalize uneven expert usage. Typical forms include $\mathcal{L}_{\text{load}} = N \sum_j f_j \rho_j$ (fraction and router-mean for expert $j$ ) (Gu et al., 8 Jul 2025), or squared mean-importance loss $\bar{L}_m = n \sum_j (\text{Imp}_j)^2$ (You et al., 2021), or pixel-wise $L_{bal} = N_r \sum_e p_e^2$ (Fang et al., 23 Nov 2025).
Sparsity/entropy regularization: $L_1$ loss on router outputs to encourage one-hotness (You et al., 2021).
Task/domain prediction losses: Where hierarchical routers operate on explicit or inferred attributes, an auxiliary loss ensures correct prediction (e.g., $L_{tp} = -\log P^t[t_{\text{gold}}]$ in THOR-MoE (Liang et al., 20 May 2025)).
Three-stage curriculum/training: HiLoMoE uses progressive unfreezing and warmup to avoid route/expert collapse (Zeng et al., 12 Oct 2025).

Gradient flow is sometimes restricted (e.g., only gating networks receive gradients from auxiliary losses), further controlling specialization dynamics.

5. Applications and Empirical Results

Hierarchical shared-routed MoE architectures are deployed in a range of domains:

Code-switching ASR (DLG-MoE (Huang et al., 26 Jul 2024)): DLG-MoE achieves lower mixed error rates (MER) on CS benchmarks, enabling monolingual expert extraction for deployment.
Speech recognition (Omni-router, SpeechMoE): Shared routers outperform per-layer routing, yielding 8–11% relative WER/CER reductions (Gu et al., 8 Jul 2025, You et al., 2021).
Neural machine translation (THOR-MoE): Task-guided routing and context-fusion achieve up to +1.74 BLEU and +0.93 BLEU on multi-domain and multilingual NMT, with 22% of experts active at inference (Liang et al., 20 May 2025).
MRI image reconstruction (HiFi-MambaV2): Hierarchical shared-routed MoE in an unrolled Mamba network improves PSNR and SSIM versus dense, flat, and prior models (+0.56 dB PSNR over HiFi-MambaV1) (Fang et al., 23 Nov 2025).
CTR prediction (HiLoMoE): Achieves +0.15–0.20 pp AUC over non-MoE and 18.5% FLOPs reduction via parameter-efficient rank-1 LoRA experts and hierarchical routing (Zeng et al., 12 Oct 2025).
Dynamical systems (MixER): Excels in discovering hierarchical clusters in ODE families via explicit context clustering and top-1 routing (Nzoyem et al., 7 Feb 2025).

6. Flexibility, Pruning, and Deployment Considerations

Hierarchical shared routing architectures enable flexible and efficient deployment:

Dynamic inference: Models such as DLG-MoE and THOR-MoE can vary $k$ at inference, trading off throughput and capacity without retraining (Huang et al., 26 Jul 2024, Liang et al., 20 May 2025).
Monolingual/Task-Specific submodels: Routers and experts outside a desired group/language can be pruned, yielding compact, task-specific deployments without retraining the entire backbone (Huang et al., 26 Jul 2024).
Streaming: Frame/token-synchronous routers (no lookahead) support true online processing (Huang et al., 26 Jul 2024).
Parallel expert execution: HiLoMoE routes are calculated sequentially, but all experts are applied in a single fused pass for inference efficiency (Zeng et al., 12 Oct 2025).
Plug-and-play: Modular router designs (THOR-MoE) allow hierarchical gates to integrate with a diverse set of standard MoE/Top- $k$ /Top- $p$ schemes (Liang et al., 20 May 2025).

7. Limitations and Open Challenges

While the hierarchical shared-routed MoE framework demonstrates substantial gains, several limitations persist:

Expert collapse and specialization bottlenecks: In low-data or highly related regimes, experts may be underutilized, or context ambiguity may lead to poor routing and sub-optimal adaptation (shown in MixER and HiLoMoE) (Nzoyem et al., 7 Feb 2025, Zeng et al., 12 Oct 2025).
Training instability: Deep or multi-stage gating demands careful curriculum (progressive warmup), load balancing, and initialization strategies (Zeng et al., 12 Oct 2025).
Router expressivity: Current shared router parameterizations are typically linear; more expressive, non-linear routers or those accommodating dynamic expert sets may further improve adaptability.
Scalability: Combinatorial growth of possible expert paths in deep hierarchies can present computational and optimization challenges; managing this growth remains an open direction.

A plausible implication is that increasing the flexibility and expressivity of hierarchical/shared routing, while maintaining parameter and computational efficiency, will drive future developments in MoE-based modeling for both language, perception, and scientific discovery tasks.