Mixture of Rank-Wise Experts (MoRE)

Updated 7 December 2025

Mixture of Rank-Wise Experts (MoRE) is a neural network architecture that decomposes low-rank adaptation modules into independent rank-level experts, enabling dynamic and input-conditioned sparse routing.
It generalizes methods like Mixture-of-Experts and blockwise partitioning in LoRA, achieving enhanced parameter efficiency and specialization for tasks in LLM adaptation, federated fine-tuning, and image super-resolution.
Empirical results demonstrate that MoRE attains state-of-the-art performance with reduced computation and robust multi-task learning, though it requires careful tuning of routing strategies and load-balancing mechanisms.

A Mixture of Rank-Wise Experts (MoRE) is a neural network architecture that encodes expert modularity at the finest possible granularity by decomposing low-rank adaptation modules into independent rank-level “experts” and applying dynamic, often input- or task-conditioned, sparse soft routing at the rank level. MoRE generalizes and subsumes priors such as Mixture-of-Experts (MoE) and blockwise partitioning in LoRA, enabling both fine-grained parameter efficiency and enhanced specialization. This concept is now deployed in LLM adaptation, federated fine-tuning, image super-resolution, and robust multi-task learning, with theoretical and empirical demonstrations of improved tradeoffs over conventional PEFT schemes.

1. Mathematical Foundation and Model Formulation

At the core, MoRE begins from the low-rank decomposition of a learnable weight update for a parameter-efficient adaptation, as in LoRA: $\Delta W = B\,A, \quad B\in\mathbb{R}^{d_{\text{out}}\times r},\;\; A\in\mathbb{R}^{r\times d_{\text{in}}}$ where $r\ll\min(d_{\text{out}}, d_{\text{in}})$ is the effective rank. MoRE treats each rank-$1$ component as a distinct expert: $\Delta W\, x = \sum_{i=1}^r B_{:,i} (A_{i,:}\,x) \;=\; \sum_{i=1}^r E_i(x), \quad E_i(x) = B_{:,i}\;(A_{i,:} x)$ Similarly, in generalized expert mixtures with learnable soft gating $g(x)$ and top- $K$ sparse activation: $\Delta W(x) = \sum_{i\in\mathcal{S}} g_i(x) B_{:,i} (A_{i,:} x)$ where $\mathcal{S}$ is the support set of active experts for input $x$ (typically $|\mathcal{S}|=K\ll r$ ).

This fine partitioning is mathematically equivalent to a block-diagonal gating on a larger LoRA, unifying blockwise MoE-LoRA and rank-wise routed LoRA within a single framework (Zhao et al., 25 Jan 2025). MoRE architectures can also be extended to hierarchical settings, with rank and expert allocation schedules defined per layer or across modality-specific streams (Cong et al., 6 Feb 2025, Zhang et al., 20 May 2025).

2. Gating, Routing, and Selection Strategies

MoRE implementations utilize either explicit routing networks or implicit mechanisms. In explicit routing, a learnable function (MLP, shallow conv, or linear) computes gating logits per expert: $g(x) = \mathrm{softmax}(W_\text{router} x)$ followed by top- $K$ selection and normalization (Wu et al., 30 Nov 2025, Zhao et al., 25 Jan 2025). Inputs are dynamically assigned to the most relevant rank-level experts, enabling input-conditional specialization and parameter sparsity. Top- $K$ sparsity is crucial for resource constraints and compute/energy budgeting, as in federated fine-tuning (Wu et al., 30 Nov 2025).

In implicit routing, as exemplified by FlyLoRA, the router is folded into a fixed, frozen sparse random projection. Active indices in $A x$ (for $A$ a sparse, random LoRA down-projection) serve as expert selectors: $\mathcal{S}(x) = \operatorname{TopK}(\left|A x\right|, K)$ achieving lightweight, parameter-free expert selection and preserving semantic neighborhoods (Zou et al., 9 Oct 2025).

Some MoRE variants, such as T-REX, also incorporate semantic priors by clustering token embeddings and injecting cluster-affinity scores into the routing (Zhang et al., 13 Apr 2024). These hybrid schemes further regularize or guide routing toward task- or domain-appropriate experts.

3. Hierarchical and Structural Extensions

Hierarchical MoRE designs—most notably S’MoRE and HILO—move beyond flat or per-layer rank partitioning by organizing experts into multilayered trees or adaptive hierarchies. Each layer or group of residuals within S'MoRE spans a hierarchy of low-rank projections, with a multi-layer router traversing the expert subspace: $\textrm{Layer Output:} \quad x_{\ell+1}^i = \sum_{n\in N_\ell(i)}\!\alpha_{\ell,i,n}\,\sigma\big(B_\ell^n\,A_\ell^n\,x + W_\ell\,x_\ell^n\big)$ This yields exponential “structural flexibility” in the number of possible computation paths, achieving superior expressivity under a fixed parameter/FLOP budget compared to flat mixtures (Zeng et al., 8 Apr 2025). HILO, by contrast, adapts both the number and rank of experts per layer according to a hierarchical schedule, allocating more capacity to deeper, more semantically demanding layers (Cong et al., 6 Feb 2025).

4. Training Strategies and Loss Formulations

MoRE frameworks in practice feature three interconnected innovation layers: parameter-efficient expert composition, sparse and balanced routing, and multi-objective optimization. Loss functions typically mix the primary supervision loss (e.g., cross-entropy for LLMs, regression/classification losses for multimodal/vision tasks) with a load-balancing or utilization-promoting auxiliary loss: $\mathcal{L}_\text{total} = \mathcal{L}_\text{task} + \alpha\,\mathcal{L}_\text{load-balance}$ where $\mathcal{L}_\text{load-balance}$ penalizes imbalanced router usage, e.g. via squared expected gating frequencies (Cong et al., 6 Feb 2025). Degradation-aware or resource-aware schemes further condition the router budget or loss weight on input complexity or device constraints (He et al., 20 Nov 2025, Wu et al., 30 Nov 2025).

Advanced optimization strategies deploy Riemannian preconditioners to ensure the union of LoRA subspaces produces unbiased updates, countering the scaling bias in conventional MoE-LoRA (Sun et al., 20 Feb 2025). Engineering approximations tune the scaling of gradients with respect to gates, and in federated settings, only router weights need be synchronized across clients, minimizing communication (Wu et al., 30 Nov 2025).

5. Representative Applications and Empirical Outcomes

MoRE architectures have demonstrated broad utility:

Multimodal and Multi-task Learning: MMoLRE achieves SOTA on multimodal sentiment analysis and competitive results on emotion recognition with as little as $\approx 1/6$ the parameter/FLOP cost versus full-rank MoE (Zhang et al., 20 May 2025).
Federated Fine-Tuning: SmartFed, with MoRE and Elastic Expert Quota Allocation (EEQA), surpasses parameter- and communication-inefficient baselines by 4–10% absolute accuracy across composition tasks, achieves up to 31 $\times$ reduction in communication cost, and robust data efficiency (Wu et al., 30 Nov 2025).
LLM and PEFT: SMoRA and T-REX deliver improved multi-task accuracy and superior merging robustness, activating a small subset of rank-wise experts per token with negligible overhead (Zhao et al., 25 Jan 2025, Zhang et al., 13 Apr 2024). S’MoRE architectures attain 1–2% absolute gains vs. state-of-the-art LoRA/MoE baselines at effectively matched parameter cost (Zeng et al., 8 Apr 2025).
Image Super-Resolution: SeemoRe and MoR-based diffusion models exploit rank-wise expert mining, degradation-aware gating, and shared+routed partitions to outperform dense LoRA and conventional MoEs in both PSNR/SSIM and no-reference metrics, while enabling dynamic computation-skipping via zero-expert slots (Zamfir et al., 5 Feb 2024, He et al., 20 Nov 2025).

The empirical principle is consistent: finer granularity in expert partitioning with dynamic sparse routing yields superior accuracy, parameter efficiency, and specialization in both unimodal and multimodal, as well as single- and multi-task regimes.

6. Efficiency, Parameter Allocation, and Theoretical Guarantees

MoRE architectures yield consistent improvements in several axes:

Parameter and Compute Efficiency: Only a small fraction $K/r$ ( $K\! \ll\! r$ ) of parameters are activated per token, driving down both memory footprint and energy cost—SMoRA activates 12.5% of LoRA parameters per token and still outperforms dense LoRA or blockwise MoE baselines (Zhao et al., 25 Jan 2025).
Subspace Expansion and Expressivity: Quadratic subspace growth via “mix-and-match” of rank-1 experts (T-REX) achieves much of the approximation power of high-rank LoRA with linear parameter cost (Zhang et al., 13 Apr 2024). S’MoRE demonstrates exponential structural flexibility over flat MoEs (Zeng et al., 8 Apr 2025).
Robustness to Negative Transfer: Fine partitioning and sparse expert activation mitigate inter-task and intra-task interference, enabling robust merging and continual adaptation (Zou et al., 9 Oct 2025, Zhao et al., 25 Jan 2025).
Dynamic and Context-Adaptive Routing: Incorporating input or degradation context (task, device, sample “hardness”) with routing priors or auxiliary signals allows adaptive expert budget allocation and improved resource utilization under budget constraints (Wu et al., 30 Nov 2025, He et al., 20 Nov 2025).

A summary table highlights these empirical findings:

Framework	Partition Granularity	Routing/Selection	Key Efficiency/Performance
MMoLRE (Zhang et al., 20 May 2025)	Expert (low-rank)	Softmax, top- $k$	$5.7\times$ param. reduction, SOTA-MSA
SMoRA (Zhao et al., 25 Jan 2025)	Rank-wise	Per-rank router/top- $k$	$+1$ –$1.2$ pts avg., $<13$ \% param. use
FlyLoRA (Zou et al., 9 Oct 2025)	Rank-wise	Implicit, fixed-proj	$0.13$\% params, $–2.02$ \% drop after model merging
S'MoRE (Zeng et al., 8 Apr 2025)	Hierarchical (layer)	Tree, GNN-style	$+1.4$ –$2.1$\% over best MoE/LoRA
T-REX (Zhang et al., 13 Apr 2024)	Rank-1, mix-n-match	Semantic-aware, task	$+1.78$ \% acc., $30$–$40$\% param. reduction

7. Limitations and Areas for Future Research

MoRE approaches introduce additional complexity—tuning $K$ , load-balance loss weights, partition schedule ( $r$ , $E$ , hierarchical depth), and router type is often task- and hardware-dependent. Efficient sparse matmul and specialized kernels may be required for full benefit in high-rank or resource-constrained settings (Zhao et al., 25 Jan 2025).

Extensions under investigation include token-adaptive rank selection (learning both which experts and how many per input), meta-learned hierarchical schedules (HILO), and further integration of semantic priors in the router. Dynamic or learned fan-out in hierarchical MoREs, and incorporation into other attention-based adaptation points, offer additional axes of flexibility (Cong et al., 6 Feb 2025, Zeng et al., 8 Apr 2025). For scaling, ensuring the router remains robust to overfitting and diversifies utilization is an area of active paper.

In summary, the Mixture of Rank-Wise Experts paradigm enables highly parameter-efficient, specialization-driven neural fine-tuning by exploiting the modularity inherent in low-rank decompositions, unlocking new tradeoffs between accuracy, compute, and adaptability in modern multimodal and multitask architectures.