Each Rank Could be an Expert: Single-Ranked Mixture of Experts LoRA for Multi-Task Learning (2501.15103v1)

Published 25 Jan 2025 in cs.LG and cs.AI

Abstract: Low-Rank Adaptation (LoRA) is widely used for adapting LLMs to specific domains due to its efficiency and modularity. Meanwhile, vanilla LoRA struggles with task conflicts in multi-task scenarios. Recent works adopt Mixture of Experts (MoE) by treating each LoRA module as an expert, thereby mitigating task interference through multiple specialized LoRA modules. While effective, these methods often isolate knowledge within individual tasks, failing to fully exploit the shared knowledge across related tasks. In this paper, we establish a connection between single LoRA and multi-LoRA MoE, integrating them into a unified framework. We demonstrate that the dynamic routing of multiple LoRAs is functionally equivalent to rank partitioning and block-level activation within a single LoRA. We further empirically demonstrate that finer-grained LoRA partitioning, within the same total and activated parameter constraints, leads to better performance gains across heterogeneous tasks. Building on these findings, we propose Single-ranked Mixture of Experts LoRA (\textbf{SMoRA}), which embeds MoE into LoRA by \textit{treating each rank as an independent expert}. With a \textit{dynamic rank-wise activation} mechanism, SMoRA promotes finer-grained knowledge sharing while mitigating task conflicts. Experiments demonstrate that SMoRA activates fewer parameters yet achieves better performance in multi-task scenarios.

PDF Abstract

The paper presents a rigorous paper of low-rank adaptation in multi-task settings by integrating a Mixture of Experts (MoE) mechanism directly into LoRA architectures. In contrast to conventional approaches where LoRA uniformly updates all rank components, the proposed method leverages each rank as an independent expert. This is achieved via a dynamic, rank‐wise sparse activation strategy that enables fine-grained parameter decoupling while preserving shared information across heterogeneous tasks.

The core contributions and technical findings can be summarized as follows:

Unified Framework and Equivalence Analysis

The work first establishes that a multi-LoRA MoE system—where multiple LoRA modules serve as experts with independent gating—is mathematically equivalent to a single LoRA module with block-wise activation. In particular, by partitioning the full rank space into smaller blocks, the authors demonstrate that the forward pass of a multi-expert system can be reformulated as

$y = W_0x + B\,G(x)\,Ax,$

where - $W_0$ denotes the base weight matrix, - $A \in \mathbb{R}^{r \times d}$ and $B \in \mathbb{R}^{d \times r}$ are the low-rank matrices, - $G(x)$ is a diagonal gating matrix generated via a top-k routing mechanism, and - $x \in \mathbb{R}^d$ is the input.

This reformulation provides insight into the advantage of finer parameter segmentation: it enables a more precise allocation of parameters per task, thus mitigating task interference.

Proposed SMoRA: Dynamic Rank-wise Activation

The Single-ranked Mixture of Experts LoRA (SMoRA) method treats each rank of the LoRA update as a separate expert and employs a dynamic routing function defined as

$g(x) = \operatorname{Softmax}\big(\operatorname{TopK}(xW_g + b)\big),$

where - $W_g \in \mathbb{R}^{d \times r}$ is a learnable projection matrix, - $b \in \mathbb{R}^{r}$ is an auxiliary bias term crucial for load balancing, and - $\operatorname{TopK}$ selects the top-k ranks per input token.

With this formulation, only the most relevant ranks are activated on a per-token basis, thus achieving an effective trade-off between computational efficiency and expressive capacity. An adaptive load-balancing update (e.g., $b_i \gets b_i + u \cdot \operatorname{sign}(e_i)$ , where $e_i$ quantifies the deviation of expert load) is introduced to ensure balanced routing across ranks.

Efficient Sparse Computation via Custom CUDA Kernel

To address computational bottlenecks inherent in sparse matrix operations, the authors implement an indexed matrix multiplication kernel using TVM. By leveraging the top-k indices from the gating function, the kernel performs dynamic extraction of the required rows and columns from the low-rank matrices. This approach significantly reduces both the computational overhead and GPU memory usage compared to standard PyTorch operators and for-loop implementations.

Empirical Evaluation and Ablation Studies

The experimental setup covers a wide array of tasks spanning FLAN-v2 (both NLU and NLG) and a multi-domain benchmark including MMLU, GSM8K, and HumanEval. Experiments are conducted on models such as Llama-2-7b and Llama-2-13b.
Key numerical results include:
- SMoRA, while activating only 8 out of 64 total ranks, achieves a 1.73% improvement over a full fine-tuned 64-rank LoRA on Llama-2-7b.
- When compared to an 8-rank LoRA, SMoRA shows an 11.16% performance improvement on Llama-2-7b.
- In comparisons with MoE variants employing block-wise top-1 routing, SMoRA outperforms them by 6.13% on Llama-2-7b.
An ablation on the activated rank number reveals that performance peaks with 8 activated experts—activating too many leads to excessive knowledge sharing (and consequent task interference), while too few restrict sufficient parameter utilization.
Visualization of the routing distributions confirms that the dynamic rank-wise activation enables distinct task-specific expert allocations, with similar tasks naturally sharing more parameters.

Comparison with Related Approaches

The paper contrasts SMoRA with state-of-the-art PEFT methods such as HydraLoRA, SMEAR, MoSLoRA, and various MoE-based LoRA frameworks. While traditional MoE approaches rely on coarse block-level routing, SMoRA’s rank-wise mechanism allows for much finer-grained parameter adaptation. Furthermore, unlike methods that mix all available ranks (often with fixed mixture matrices), SMoRA’s dynamic routing not only reduces the number of activated parameters but also improves adaptability across tasks without additional training overhead.

In conclusion, the paper provides a comprehensive analysis and empirical validation of a novel parameter-efficient fine-tuning approach that embeds an MoE structure within a single LoRA module. Through dynamic rank-wise activation, training-time load balancing, and efficient sparse computation via a custom CUDA kernel with TVM, SMoRA achieves superior performance on multi-task benchmarks while substantially reducing the number of parameters actively updated per token.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Ziyu Zhao (28 papers)
Yixiao Zhou (13 papers)
Didi Zhu (19 papers)
Tao Shen (87 papers)
Xuwu Wang (12 papers)
Jing Su (47 papers)
Kun Kuang (114 papers)
Zhongyu Wei (98 papers)
Fei Wu (317 papers)
Yu Cheng (354 papers)

Each Rank Could be an Expert: Single-Ranked Mixture of Experts LoRA for Multi-Task Learning (2501.15103v1)

Related Papers