ReMix: RL Routing for Mixture-of-LoRAs

Updated 4 July 2026

The paper introduces ReMix, a method that employs reinforcement learning to select discrete subsets of LoRA experts with fixed, equal contributions.
It replaces traditional soft routing weights with a constant weight scheme, effectively addressing weight imbalance and maximizing expert utilization.
Empirical evaluations on GSM8K, HumanEval, and ARC-c benchmarks demonstrate that ReMix outperforms prior methods in both accuracy and efficiency.

Reinforcement Routing for Mixture-of-LoRAs (ReMix) is a parameter-efficient finetuning method for LLMs that redefines how Mixture-of-LoRAs routers use expert weights. Instead of letting a router assign learned soft weights to active LoRA experts, ReMix selects a subset of experts per layer and gives every selected expert the same fixed contribution, while training the discrete router with a reinforcement-learning estimator based on reinforce leave-one-out (RLOO) (Qiu et al., 10 Mar 2026). The method is motivated by the observation that prior Mixture-of-LoRAs routers often exhibit severe routing-weight imbalance, so that one expert dominates and the others become only nominally active; ReMix treats this as a structural weakness of learnable routing weights rather than a mere optimization artifact (Qiu et al., 10 Mar 2026).

1. Conceptual setting and motivation

ReMix is formulated in the standard low-rank adaptation setting. A frozen pretrained model is augmented with multiple LoRA experts per layer, each contributing a trainable low-rank perturbation of the form $\Delta W = BA$ , with $A \in \mathbb{R}^{r \times D}$ and $B \in \mathbb{R}^{D \times r}$ for rank $r \ll D$ (Qiu et al., 10 Mar 2026). A Mixture-of-LoRAs layer then attaches several such low-rank experts to the same frozen backbone layer and uses a router to decide which experts should contribute for a given input.

The central pathology identified by ReMix is routing-weight collapse. In prior soft routers, a layer computes a softmax distribution over experts and forms a weighted sum of expert outputs. The paper argues that these routing weights are often extremely imbalanced in practice, so that even when $k>1$ experts are notionally activated, one LoRA receives weight close to $1$ and the others receive weights near $0$ (Qiu et al., 10 Mar 2026). This reduces the effective number of experts, wastes activated compute, and suppresses gradients flowing to underweighted experts. ReMix quantifies the phenomenon with the effective support size (ESS),

$\operatorname{ESS}(\boldsymbol{\pi}^{(l)}) := \frac{\left(\sum_{i=1}^n |\pi_i^{(l)}|\right)^2}{\sum_{i=1}^n |\pi_i^{(l)}|^2} = \left(\frac{\|\boldsymbol{\pi}^{(l)}\|_1}{\|\boldsymbol{\pi}^{(l)}\|_2}\right)^2,$

where $\operatorname{ESS}=1$ for one-hot routing and $\operatorname{ESS}=n$ for uniform routing over $A \in \mathbb{R}^{r \times D}$ 0 experts (Qiu et al., 10 Mar 2026).

This diagnosis places ReMix within a broader routing literature but also distinguishes it from earlier Mixture-of-LoRAs systems. Mixture-of-LoRAs (MoA) uses explicit supervised sequence-level routing with domain labels rather than reinforcement learning (Feng et al., 2024). LoRA-Mixer uses token-level soft routing over projection-layer LoRA updates and trains its router by differentiable backpropagation rather than policy gradients (Li et al., 17 Jun 2025). Mixture of Routers replaces a single router with multiple sub-routers but still uses standard gradient-based training (Zhang et al., 30 Mar 2025). ReMix departs from all of these by making expert selection discrete and treating router training as a reinforcement-learning problem (Qiu et al., 10 Mar 2026).

2. Router parameterization and equal-weight expert activation

The prior Mixture-of-LoRAs formulation analyzed by ReMix uses a router matrix $A \in \mathbb{R}^{r \times D}$ 1 to produce routing weights

$A \in \mathbb{R}^{r \times D}$ 2

and the layer output is

$A \in \mathbb{R}^{r \times D}$ 3

Here $A \in \mathbb{R}^{r \times D}$ 4 and $A \in \mathbb{R}^{r \times D}$ 5 are the layer input and output, $A \in \mathbb{R}^{r \times D}$ 6 is the frozen pretrained weight, and $A \in \mathbb{R}^{r \times D}$ 7 is the $A \in \mathbb{R}^{r \times D}$ 8-th LoRA update (Qiu et al., 10 Mar 2026).

ReMix preserves the learnable router distribution but changes the role it plays. For each layer input $A \in \mathbb{R}^{r \times D}$ 9, the router first computes

$B \in \mathbb{R}^{D \times r}$ 0

This vector is not used as the final coefficient vector in the layer output. Instead, it parameterizes selection of a subset

$B \in \mathbb{R}^{D \times r}$ 1

During training, ReMix samples $B \in \mathbb{R}^{D \times r}$ 2 experts from $B \in \mathbb{R}^{D \times r}$ 3 without replacement; during inference, it uses the top- $B \in \mathbb{R}^{D \times r}$ 4 experts according to $B \in \mathbb{R}^{D \times r}$ 5 (Qiu et al., 10 Mar 2026).

The defining step is the replacement of learned expert weights with constant routing weights:

$B \in \mathbb{R}^{D \times r}$ 6

The paper considers two choices for the constant scale:

$B \in \mathbb{R}^{D \times r}$ 7

Under this design, every active expert contributes equally, and the effective support size becomes exactly

$B \in \mathbb{R}^{D \times r}$ 8

by construction (Qiu et al., 10 Mar 2026).

The corresponding ReMix layer is

$B \in \mathbb{R}^{D \times r}$ 9

The paper’s core claim is that this turns nominally active experts into equally effective active experts and thereby restores the expressive capacity that soft routing often suppresses (Qiu et al., 10 Mar 2026).

3. Reinforcement-learning formulation and RLOO estimator

Because the final routing weights are constant and the selected subset $r \ll D$ 0 is discrete, the output is no longer directly differentiable with respect to router parameters. ReMix therefore treats routing as a reinforcement-learning problem. For one model input, let

$r \ll D$ 1

denote the selected expert subsets across all layers, and let $r \ll D$ 2 be the supervised finetuning loss under this routing configuration (Qiu et al., 10 Mar 2026).

The router acts as the policy, the selected subsets are the actions, and the reward is the negative supervised finetuning loss. The objective is to optimize

$r \ll D$ 3

For each layer, ReMix samples $r \ll D$ 4 experts without replacement. If $r \ll D$ 5 is the $r \ll D$ 6-th full routing selection, then the probability of that selection is

$r \ll D$ 7

ReMix samples $r \ll D$ 8 such routing configurations per input and uses an RLOO-style REINFORCE estimator for the router gradient:

$r \ll D$ 9

with

$k>1$ 0

The paper states that this estimator is unbiased:

$k>1$ 1

This RLOO baseline is used to reduce variance while preserving unbiasedness (Qiu et al., 10 Mar 2026).

The training algorithm is hybrid. LoRA parameters $k>1$ 2 and $k>1$ 3 are still updated by ordinary backpropagation through the supervised loss, while router parameters $k>1$ 4 are updated by the policy-gradient estimator (Qiu et al., 10 Mar 2026). Inference is deterministic: ReMix selects

$k>1$ 5

and reuses the same equal-weight activation rule. The paper also states a sufficient condition for top- $k>1$ 6 inference: if the trained router assigns the optimal subset probability greater than $k>1$ 7, then the top- $k>1$ 8 entries of $k>1$ 9 recover that optimal subset exactly (Qiu et al., 10 Mar 2026).

A distinctive feature of this formulation is that the number of route samples $1$0 becomes an explicit training-compute budget. This suggests a route-sampling scaling axis that deterministic soft routers do not expose in the same way (Qiu et al., 10 Mar 2026).

4. Empirical results and ablations

ReMix is evaluated on Llama 3 8B and three benchmarks: GSM8K for mathematical reasoning, HumanEval for code generation, and ARC-c for knowledge recall and reasoning (Qiu et al., 10 Mar 2026). HumanEval is finetuned through CodeAlpaca because HumanEval lacks a training set. The comparison set includes Prompt Tuning, P-Tuning, Prefix Tuning, $1$1, LoRA, DoRA, rsLoRA, VB-LoRA, MixLoRA, and HydraLoRA (Qiu et al., 10 Mar 2026).

The main reported results are:

GSM8K: LoRA 59.21, rsLoRA 62.47, MixLoRA 61.87, HydraLoRA 62.47, ReMix 65.66.
HumanEval (Pass@1): LoRA 26.83, DoRA 31.10, rsLoRA 28.66, MixLoRA 28.05, HydraLoRA 20.12, ReMix 32.93.
ARC-c: LoRA 83.05, DoRA 83.39, rsLoRA 82.71, MixLoRA 82.37, HydraLoRA 82.71, ReMix 83.73.
Average: the strongest non-ReMix baseline average is reported around 57.95 for rsLoRA, while ReMix reaches 60.77 (Qiu et al., 10 Mar 2026).

The activated-parameter comparison is central to the paper’s fairness claim. Average activated or trainable parameter counts are reported as 0.101B for MixLoRA, 0.084B for HydraLoRA, 0.028B for rsLoRA, and 0.070B for ReMix (Qiu et al., 10 Mar 2026). This is presented as evidence that ReMix is not simply scaling parameter count.

Several ablations support the proposed interpretation of routing diversity. With $1$2 experts, GSM8K improves as the number of activated experts increases:

$1$3

By contrast, a rank-$1$4 LoRA with the same LoRA-parameter count performs worse:

rank-$1$5 LoRA: 56.10 for $1$6, 54.51 for $1$7, 59.21 for $1$8;
ReMix with $1$9 rank-$0$0 LoRAs: 56.18 for $0$1, 59.67 for $0$2, 64.22 for $0$3 (Qiu et al., 10 Mar 2026).

The route-sampling budget $0$4 also matters. On GSM8K, increasing $0$5 from $0$6 to $0$7 improves accuracy from 56.03% to 58.83% (Qiu et al., 10 Mar 2026). This is one of the paper’s most distinctive empirical claims: router learning benefits from larger Monte Carlo budgets. A training-efficiency comparison with MixLoRA reports 8.95 s/step, total 1:12:56, accuracy 50.34 for MixLoRA, versus 9.87 s/step, total 1:28:21, accuracy 58.38 for ReMix (Qiu et al., 10 Mar 2026).

The paper also reports that removing RLOO or removing top-$0$8 inference selection hurts GSM8K accuracy, and that the method is not very sensitive to whether $0$9 is set to $\operatorname{ESS}(\boldsymbol{\pi}^{(l)}) := \frac{\left(\sum_{i=1}^n |\pi_i^{(l)}|\right)^2}{\sum_{i=1}^n |\pi_i^{(l)}|^2} = \left(\frac{\|\boldsymbol{\pi}^{(l)}\|_1}{\|\boldsymbol{\pi}^{(l)}\|_2}\right)^2,$ 0 or $\operatorname{ESS}(\boldsymbol{\pi}^{(l)}) := \frac{\left(\sum_{i=1}^n |\pi_i^{(l)}|\right)^2}{\sum_{i=1}^n |\pi_i^{(l)}|^2} = \left(\frac{\|\boldsymbol{\pi}^{(l)}\|_1}{\|\boldsymbol{\pi}^{(l)}\|_2}\right)^2,$ 1 (Qiu et al., 10 Mar 2026). Together, these results support the claim that ReMix’s gain is tied not just to discrete routing, but to the combination of equal-weight active experts, top- $\operatorname{ESS}(\boldsymbol{\pi}^{(l)}) := \frac{\left(\sum_{i=1}^n |\pi_i^{(l)}|\right)^2}{\sum_{i=1}^n |\pi_i^{(l)}|^2} = \left(\frac{\|\boldsymbol{\pi}^{(l)}\|_1}{\|\boldsymbol{\pi}^{(l)}\|_2}\right)^2,$ 2 inference, and RLOO-based router optimization.

5. Position within Mixture-of-LoRAs research

ReMix belongs to a rapidly diversifying family of multi-LoRA routing methods, but its routing philosophy is unusually specific. Mixture-of-LoRAs (MoA) treats each domain-specific LoRA as an expert and uses sequence-level supervised routing with domain labels; the router is trained jointly with the experts using a language-model loss plus a classification loss, and no reinforcement learning is used (Feng et al., 2024). LoRA-Mixer routes token-level projection-layer LoRA updates through a hard-soft hybrid, trains with a specialization-balance objective, and switches to sparse Top-3 at inference, but remains fully differentiable rather than policy-gradient based (Li et al., 17 Jun 2025). Mixture of Routers replaces a single router with multiple sub-routers plus a main router, again under standard backpropagation and without any reward-based optimization (Zhang et al., 30 Mar 2025).

Other contemporaneous systems explore different answers to the same routing problem. SEQR formalizes training-free unsupervised routing as activation norm maximization and gives an exact top-1 norm-maximizing router in the shared- $\operatorname{ESS}(\boldsymbol{\pi}^{(l)}) := \frac{\left(\sum_{i=1}^n |\pi_i^{(l)}|\right)^2}{\sum_{i=1}^n |\pi_i^{(l)}|^2} = \left(\frac{\|\boldsymbol{\pi}^{(l)}\|_1}{\|\boldsymbol{\pi}^{(l)}\|_2}\right)^2,$ 3 regime, with strict guarantees for that proxy objective but no learned reward optimization (Fleshman et al., 22 Sep 2025). LD-MoLE replaces fixed TopK routing with a differentiable Sparsegen-based sparse projection and learns a token-dependent, layer-wise number of active experts, explicitly positioning itself as a differentiable alternative to discrete routing rather than an RL method (Zhuang et al., 30 Sep 2025). “Learning to Select, Not Relearn” proposes Hard-Routed MoR-LoRA, where independently trained reasoning LoRAs are frozen and a hard top-1 token router is trained with a straight-through estimator; reinforcement learning is used chiefly to create the experts, not the router (Molavi et al., 30 Jun 2026).

From this landscape, ReMix can be characterized precisely. It does not use supervised domain labels like MoA, it does not relax routing into a differentiable sparse projection like LD-MoLE, and it does not reduce routing to a hand-designed norm proxy like SEQR. Its defining choice is to keep the router’s learned softmax only as a sampling distribution over subsets, while eliminating learned post-selection weighting entirely (Qiu et al., 10 Mar 2026). This suggests that ReMix is less about discovering more expressive weighting functions than about forcing subset selection itself to carry the expert-allocation burden.

6. Interpretive debates, assumptions, and limitations

The ReMix paper notes several implicit limitations. RL-style router training can have higher variance than ordinary backpropagation, performance depends on the route-sampling budget $\operatorname{ESS}(\boldsymbol{\pi}^{(l)}) := \frac{\left(\sum_{i=1}^n |\pi_i^{(l)}|\right)^2}{\sum_{i=1}^n |\pi_i^{(l)}|^2} = \left(\frac{\|\boldsymbol{\pi}^{(l)}\|_1}{\|\boldsymbol{\pi}^{(l)}\|_2}\right)^2,$ 4, and the top- $\operatorname{ESS}(\boldsymbol{\pi}^{(l)}) := \frac{\left(\sum_{i=1}^n |\pi_i^{(l)}|\right)^2}{\sum_{i=1}^n |\pi_i^{(l)}|^2} = \left(\frac{\|\boldsymbol{\pi}^{(l)}\|_1}{\|\boldsymbol{\pi}^{(l)}\|_2}\right)^2,$ 5 inference theorem assumes that the optimal subset receives probability greater than $\operatorname{ESS}(\boldsymbol{\pi}^{(l)}) := \frac{\left(\sum_{i=1}^n |\pi_i^{(l)}|\right)^2}{\sum_{i=1}^n |\pi_i^{(l)}|^2} = \left(\frac{\|\boldsymbol{\pi}^{(l)}\|_1}{\|\boldsymbol{\pi}^{(l)}\|_2}\right)^2,$ 6, which is a strong condition (Qiu et al., 10 Mar 2026). The empirical evidence is also concentrated on three benchmarks with Llama 3 8B, so broader model and task generalization remains an open empirical question (Qiu et al., 10 Mar 2026).

A broader debate concerns what mixture-of-LoRA routing can fundamentally achieve. The position paper “Pause Recycling LoRAs and Prioritize Mechanisms to Uncover Limits and Effectiveness” argues that reusing independently trained LoRAs often fails to logically integrate knowledge across disjoint fine-tuning datasets, and that successful reuse may reflect shallow pattern matching, pretraining familiarity, or target-pattern leakage rather than genuine compositional generalization (Chen et al., 16 Jun 2025). Its theoretical analysis further claims that linear combinations of independently trained LoRAs tend to produce superpositions of stored edits rather than functional composition (Chen et al., 16 Jun 2025). ReMix does not use learned mixture weights in the same way as those linear-combination baselines, but it still operates over a library of fixed low-rank experts. This suggests that benchmark gains from ReMix should be interpreted first as improved expert selection within an available adapter library, unless further evaluations demonstrate held-out compositional transfer without target-pattern exposure.

Another misconception concerns the role of reinforcement learning itself. ReMix does not use RL to create new domain experts or to optimize sequence-level human preference scores; it uses reinforcement learning narrowly and technically, as a method for estimating gradients of a discrete router whose selected subset is non-differentiable (Qiu et al., 10 Mar 2026). In that sense, its contribution is not generic “RL for finetuning,” but a specific policy-gradient solution to expert-subset selection under constant active-expert weights.

Taken together, these points define ReMix’s place in the literature. It is a Mixture-of-LoRAs method whose central claim is architectural: learned soft expert weights are the wrong primitive because they collapse effective expert usage, and equal-weight subset selection trained with RLOO better realizes the intended capacity of a multi-LoRA system (Qiu et al., 10 Mar 2026). Whether that routing principle also solves the harder problem of cross-expert knowledge composition remains, on current evidence, a separate question (Chen et al., 16 Jun 2025).