HiLoMoE: Hierarchical LoRA Mixture-of-Experts

Updated 26 March 2026

HiLoMoE is a parameter-efficient adaptation framework that combines Low-Rank Adaptation with Mixture-of-Experts using a hierarchical schedule.
It employs non-uniform expert allocation with dynamic routing to specialize higher layers and mitigate catastrophic forgetting.
Experimental results demonstrate that HiLoMoE outperforms standard baselines in tasks like medical LLM, CTR, and ASR while reducing computational costs.

Hierarchical LoRA MoE (HiLoMoE) is a family of parameter-efficient adaptation methods for large neural networks—primarily Transformers—stemming from the synergy of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE). Defining characteristics are the use of multiple, input-routed low-rank (LoRA) adapters at each model layer, arranged according to a hierarchical (non-uniform, often top-heavy) schedule across network depth; and typically enhanced with layer-wise, token-wise, or domain/task-aware routing. The approach facilitates specialization in higher layers and robustness against catastrophic forgetting, while maintaining computational efficiency suitable for LLMs, speech recognition, recommendation systems, and multi-modal or continual learning tasks.

1. Structural Principles and Architectural Variants

The central mechanism implements a frozen backbone (e.g., LLaMA-3-8B) augmented, in each layer $l$ , with multiple trainable LoRA adapters $\{(A_{l,i}, B_{l,i})\}_{i=1}^{N_l}$ , each providing a low-rank update, $\Delta W_{l, i} = B_{l,i}A_{l,i}$ , to the main weight matrix. Instead of uniformly allocating experts across layers, HiLoMoE employs a hierarchical schedule, with $N_l$ and ranks $r_{l, i}$ increasing monotonically in deeper layers—reflecting the empirical observation that higher network layers encode semantically complex, task-specialized representations and thus demand greater adaptation capacity (Yang et al., 12 Jan 2026, Gao et al., 2024).

HiLoMoE can either implement per-layer token routing via softmax-gated scores with a learnable temperature (enabling soft or sparse merging), or more sophisticated hierarchical routers (e.g., combining global domain recognition and local feature-dependent adapters, or dual-path (beat-level/morphology and rhythm) gating in time-series models) (Zeng et al., 12 Oct 2025, Xu et al., 4 Mar 2026, Mu et al., 2024). In multi-domain, multi-lingual, or continual learning scenarios, variants split experts into "base" (for knowledge preservation) and "specialist" components, applying additional regularization to retain pre-trained knowledge (Yang et al., 12 Jan 2026, Zheng et al., 2 Jan 2026, Jia et al., 5 Jun 2025).

2. Mathematical Formulation and Routing Mechanisms

Each transformed layer output is given by:

$h_l = W_0x + \sum_{i=1}^{N_l} g_{l,i}(x)\, (B_{l,i}A_{l,i})x$

with $W_0$ frozen, and $g_{l,i}(x)$ computed via

$s_l(x) = W_g^{(l)}x \in \mathbb{R}^{N_l}, \qquad g_{l,i}(x) = \frac{\exp(s_{l,i}/\tau^{(l)})}{\sum_{j} \exp(s_{l,j}/\tau^{(l)})}$

where $\tau^{(l)}$ is a learnable (or annealed) temperature.

Crucially, expert placement ( $\{(A_{l,i}, B_{l,i})\}_{i=1}^{N_l}$ 0) and adaptation rank ( $\{(A_{l,i}, B_{l,i})\}_{i=1}^{N_l}$ 1) follow a non-linear schedule reflecting network depth:

$\{(A_{l,i}, B_{l,i})\}_{i=1}^{N_l}$ 2

with curvature parameter $\{(A_{l,i}, B_{l,i})\}_{i=1}^{N_l}$ 3 to ensure more experts/rank at higher $\{(A_{l,i}, B_{l,i})\}_{i=1}^{N_l}$ 4 (Yang et al., 12 Jan 2026, Gao et al., 2024, Cong et al., 6 Feb 2025).

Routing need not be layer-local. CTR applications employ queries recursively aggregated from prior layer router outputs for hierarchical score computation, which is computationally efficient and parallelizable (Zeng et al., 12 Oct 2025). HDMoLE (Mu et al., 2024) combines a frozen global (domain/accent) router with trainable local softmax gating and dynamic, differentiable thresholds, allowing variable numbers of experts per input and layer.

Variants such as S’MoRE (Zeng et al., 8 Apr 2025) employ multi-depth, tree-structured composition of LoRA residuals, with each "route" corresponding to a path through such a tree, and token-dependent routing decisions made hierarchically.

3. Training Objectives, Knowledge Preservation, and Parameter Efficiency

The overall loss typically combines:

Main task loss (cross-entropy, MSE, CTC, or multi-objective scalarized rewards)
Auxiliary load-balancing losses to avoid expert collapse,
Stability/identity penalties for knowledge-preserving experts,
Orthogonality and singular-value regularizers in continual/incremental setups (Yang et al., 12 Jan 2026, Jia et al., 5 Jun 2025).

For multi-task alignment, preference or task vectors steer routing (via simplex proximity or policy-weighted combination), and LoRA adapters can be statically fused from SVD-compressed parameter deltas (Li et al., 27 May 2025). Continual learning variants introduce SVD-based orthogonal updates—freezing top singular vectors post-task, constraining future updates to residual subspaces and thus mitigating catastrophic forgetting (Jia et al., 5 Jun 2025).

Parameter complexity scales with $\{(A_{l,i}, B_{l,i})\}_{i=1}^{N_l}$ 5 (for LoRA) substantially below full-rank or flat MoE alternatives; inference cost scales with the number of active per-token experts and their ranks, which are minimized thanks to sparse gating and hierarchical rank scheduling (Yang et al., 12 Jan 2026, Gao et al., 2024, Cong et al., 6 Feb 2025, Zeng et al., 12 Oct 2025).

4. Empirical Performance and Ablation Findings

Experiments across domains demonstrate consistent superiority of HiLoMoE over plain LoRA and uniform MoE-LoRA baselines:

Model	Medical Benchmarks (avg %)	General Knowledge Δ (GSM8K/MMLU)	CTR AUC Δ (%)	Multi-Accent ASR ΔCER (%)
Standard LoRA	53.8	-3.6 / -0.5	baseline	19.98
Uniform MoE-LoRA	56.8	-	baseline	18.76
HiLoMoE (Top-heavy)	59.5	-0.3 / -0.5	+0.20	16.58

Specific findings:

Medical LLMs: HiLoMoE achieved 65.8% on MedQA (+6.9% over LoRA); hierarchical allocation yielded +2.7% on average vs. uniform (Yang et al., 12 Jan 2026).
CTR: 0.20% average AUC gain, 18.5% FLOPs reduction (Zeng et al., 12 Oct 2025).
Multi-accent ASR: CER reduction vs. flat MoE; dynamic thresholds and hierarchical routers boosted domain adaptation while controlling forgetting (Mu et al., 2024).
Hierarchical rank and expert allocation consistently outperformed flat allocations under equal or smaller parameter budgets (Cong et al., 6 Feb 2025).
S’MoRE achieved up to +2 pp accuracy gains compared to single-level LoRA-MoE, at marginally higher or even lower parameter cost (Zeng et al., 8 Apr 2025).
Continual learning: SVD-based HiLoMoE halved forgetting metrics relative to non-hierarchical or non-orthogonalized LoRA variants (Jia et al., 5 Jun 2025).

5. Theoretical Insights, Extensions, and Implementation Considerations

The expressivity of HiLoMoE increases exponentially with depth and expert allocation, as the number of distinct computational graphs routed per input grows rapidly—especially in tree- or multi-depth settings (Zeng et al., 8 Apr 2025). Lower network layers exhibit high redundancy among LoRA experts; global parameter efficiency is achieved by concentrating high-rank, diverse experts near the output (Gao et al., 2024).

Extensions include: adaptive per-layer and per-expert rank selection; deeper-than-2-level hierarchical routing; multi-modal grouping for domain/task identification; integration with training-free selection (Gaussian likelihood over LoRA embeddings as in HiLoRA (Han et al., 14 Oct 2025)); and coverage of non-LLM modalities (e.g., ECG foundation modeling, CTR, ASR).

Practically, HiLoMoE architectures maintain inference costs nearly on par with standard LoRA, as only a few experts are active at each step. Three-stage training frameworks stabilize optimization in deep or heavily-MoE-augmented models (Zeng et al., 12 Oct 2025). Dynamic thresholds and domain-aware routers further improve coverage, efficiency, and cross-domain generalization, particularly for low-resource or streaming scenarios (Mu et al., 2024, Han et al., 14 Oct 2025).

6. Notable Applications Across Modalities

HiLoMoE schemes have been validated in:

Medical LLMs for clinical diagnosis, summarization, drug interaction extraction (Yang et al., 12 Jan 2026).
NLP benchmarks (ScienceQA, CommonsenseQA, OpenbookQA, MRPC, COLA, RTE), showing maximal gain in representation-rich (deep, semantic) tasks (Gao et al., 2024).
Click-through rate prediction with parallelizable hierarchical routing yielding compute savings and improved AUC (Zeng et al., 12 Oct 2025).
Multilingual and multi-accent ASR via language-agnostic hierarchical LoRA-MoE; dynamic routing based on intermediate LID posteriors (Zheng et al., 2 Jan 2026, Mu et al., 2024).
Continual embodied learning, supporting hierarchical task/planning structure and preserving prior skills using SVD-orthogonality (Jia et al., 5 Jun 2025).
Multi-objective LLM alignment using hierarchical preference and router expert layers, producing Pareto-dominant trade-off curves (Li et al., 27 May 2025).
Domain generalization in LLM adapters with training-free, hierarchical rank-one/component routing strategies (Han et al., 14 Oct 2025).

7. Limitations, Future Research, and Open Directions

Performance plateaus are observed with excessive vertical scaling in tasks with limited sequential dependencies (Zeng et al., 12 Oct 2025). Hyperparameter tuning (number and distribution of experts, ranks, gating temperatures) can be nontrivial, and the optimal split point for shared vs. domain-specific adaptation remains application-dependent (Zheng et al., 2 Jan 2026). SVD-based knowledge preservation demands careful balancing of orthogonality regularization strengths, while dynamic gating mechanisms introduce complexity in deployment for low-latency systems.

Future work encompasses generalization to deeper hierarchical configurations, further adaptive (sample-dependent) expert selection, integration with multimodal encoders, and improved parameter-count/capacity management through adaptive rank allocation or expert merging (Cong et al., 6 Feb 2025, Li et al., 27 May 2025, Han et al., 14 Oct 2025).