LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin

Published 15 Dec 2023 in cs.CL | (2312.09979v4)

Abstract: Supervised fine-tuning (SFT) is a crucial step for LLMs, enabling them to align with human instructions and enhance their capabilities in downstream tasks. Increasing instruction data substantially is a direct solution to align the model with a broader range of downstream tasks or notably improve its performance on a specific task. However, we find that large-scale increases in instruction data can damage the world knowledge previously stored in LLMs. To address this challenge, we propose LoRAMoE, a novelty framework that introduces several low-rank adapters (LoRA) and integrates them by using a router network, like a plugin version of Mixture of Experts (MoE). It freezes the backbone model and forces a portion of LoRAs to focus on leveraging world knowledge to solve downstream tasks, to alleviate world knowledge-edge forgetting. Experimental results show that, as the instruction data increases, LoRAMoE can significantly improve the ability to process downstream tasks, while maintaining the world knowledge stored in the LLM.

Abstract PDF HTML Upgrade to Chat

Authors (16)

First 10 authors:

References (55)

Citations (21)

View on Semantic Scholar

Summary

The paper introduces LoRAMoE, a novel architecture that integrates plug-in experts to preserve world knowledge during fine-tuning.
It employs localized balancing constraints to split experts into groups focused on task learning and knowledge retention.
Experimental validation shows LoRAMoE outperforms traditional fine-tuning, effectively supporting multi-task learning without sacrificing stored knowledge.

Overview of LoRAMoE

Supervised fine-tuning (SFT) is commonly employed to enhance the performance of LLMs in specific tasks by aligning them with human instructions. However, the underlying challenge with SFT is the tendency of the models to forget stored world knowledge—referred to as knowledge forgetting—when fine-tuning data is substantially increased.

Addressing Knowledge Forgetting

A novel solution, LoRAMoE, has been presented to mitigate the issue of knowledge forgetting in LLMs while maintaining efficiency in task completion. LoRAMoE is a modification of the Mixture of Experts (MoE) architecture that adds multiple plug-in experts in each model layer. These experts are trained to either focus on task-related data or on preserving world knowledge, depending on the data type they are handling. Notably, the main model parameters are frozen during training, which helps preserve previously acquired knowledge.

Expert Coordination through Constraints

The implementation of localized balancing constraints is a critical aspect of LoRAMoE's architecture. This design choice enables the intelligent division of experts into two groups, each dedicated to different tasks. One group concentrates on learning from a wide range of downstream tasks data, while the other is fine-tuned to align the world knowledge stored inside the LLM with human instructions. This strategic division allows LoRAMoE to preserve world knowledge while still enhancing performance in other areas.

Experimental Validation

The effectiveness of LoRAMoE has been verified through extensive testing. Results show that increasing instruction data no longer causes knowledge forgetting with the LoRAMoE approach. The model retains its world knowledge and even outperforms the traditional single-task fine-tuning methods on specific tasks. Moreover, LoRAMoE demonstrates a potential for efficient multi-task learning, as it improves performance across a variety of downstream tasks.

To further validate the capability specialization of LoRAMoE, visualizations of expert utilization were provided. These visualizations indicated that, depending on the task, the model allocates weight to the expert group with the most relevant skills, whether it be focusing on task-related data or world knowledge.

Conclusion

In summary, LoRAMoE emerges as a promising method for training LLMs. It offers a solution to the paramount issue of knowledge forgetting during large-scale fine-tuning, without compromising the performance of LLMs on a wide range of tasks. This approach ensures that the integrity of world knowledge is protected while also catering to the diverse requirements of downstream applications.

Markdown Report Issue