MoSLD: An Extremely Parameter-Efficient Mixture-of-Shared LoRAs for Multi-Task Learning (2412.08946v1)

Published 12 Dec 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Recently, LoRA has emerged as a crucial technique for fine-tuning large pre-trained models, yet its performance in multi-task learning scenarios often falls short. In contrast, the MoE architecture presents a natural solution to this issue. However, it introduces challenges such as mutual interference of data across multiple domains and knowledge forgetting of various tasks. Additionally, MoE significantly increases the number of parameters, posing a computational cost challenge. Therefore, in this paper, we propose MoSLD, a mixture-of-shared-LoRAs model with a dropout strategy. MoSLD addresses these challenges by sharing the upper projection matrix in LoRA among different experts, encouraging the model to learn general knowledge across tasks, while still allowing the lower projection matrix to focus on the unique features of each task. The application of dropout alleviates the imbalanced update of parameter matrix and mitigates parameter overfitting in LoRA. Extensive experiments demonstrate that our model exhibits excellent performance in both single-task and multi-task scenarios, with robust out-of-domain generalization capabilities.

PDF HTML Abstract

MoSLD: An Extremely Parameter-Efficient Mixture-of-Shared LoRAs for Multi-Task Learning

The paper presents MoSLD, a novel approach in parameter-efficient fine-tuning for multi-task learning involving LLMs. The paper addresses the limitations of LoRA in multi-task settings by integrating MoE architectures with innovative methodologies to maintain parameter efficiency while improving performance.

Technical Contributions

The paper identifies pressing issues in the fine-tuning of LLMs: the high computational cost and knowledge interference across multiple tasks. LoRA, a well-regarded parameter-efficient fine-tuning method, fails to generalize effectively across diverse tasks in a single model instance. In contrast, MoE architectures offer a promising avenue for handling multi-task learning but at the expense of increased computational resources due to the significant number of parameters involved. To circumvent these challenges, the authors propose MoSLD, characterized by:

Sharing Mechanism for LoRAs: MoSLD leverages a shared architecture for the LoRA upper projection matrix among various experts, with the aim of extracting and transferring general knowledge across tasks. Meanwhile, it maintains the lower projection matrix distinctively per task to capture task-specific features.
Dropout Strategy: To alleviate the overfitting and optimization imbalance, particularly pertaining to the general-feature matrix, a dropout technique is applied, designed to reduce parameter redundancy.

These innovations ensure that MoSLD minimizes the parameter burden typically associated with MoE architectures while preserving the ability to generalize across tasks.

Empirical Findings

Experiments underscore MoSLD's robust performance in both single-task and multi-task scenarios. MoSLD surpasses traditional LoRA configurations, especially in multi-task environments, elucidating its capacity to effectively mitigate inter-task data interference. The numerical results reflect significant improvements over the benchmarks, with marked enhancements in generalization to out-of-domain data.

In single-task learning, MoSLD performs competitively against conventional MoE approaches, demonstrating the efficacy of the proposed sharing mechanism.
In multi-task settings, MoSLD outperforms several baseline methodologies by effectively disentangling domain-specific from general knowledge, reflecting an adept balance between specificity and transferability across domains.

Implications and Future Work

The introduction of MoSLD presents substantial implications for the efficiency of LLM fine-tuning, particularly in contexts requiring the simultaneous handling of multiple tasks. The model's ability to maintain parameter-efficiency while improving the generalization capacity suggests potential deployment in resource-constrained environments or applications necessitating dynamic task handling.

From a theoretical perspective, this approach might inform future explorations into the optimization of sparse neural architectures, potentially extending beyond NLP into other domains where multi-task challenges persist.

Looking ahead, further investigations could explore the applicability of similar parameter-sharing strategies within different components of LLMs, such as the feed-forward network layers, or assess the impact of varied dropout strategies on matrix B or other components within the architecture. Additionally, there is potential to scale this architecture to address more intricate multi-task scenarios, examining its impact on computational performance in highly diverse or conflicting task settings.

In conclusion, MoSLD presents a compelling solution to the prevalent challenges in multi-task learning with LLMs, offering enhancements in parameter efficiency without compromising performance, thus setting a precedent for similar advancements in AI model fine-tuning strategies.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Lulu Zhao (34 papers)
Weihao Zeng (24 papers)
Xiaofeng Shi (11 papers)
Hua Zhou (106 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1869875776566558916