LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin (2312.09979v4)

Published 15 Dec 2023 in cs.CL

Abstract: Supervised fine-tuning (SFT) is a crucial step for LLMs, enabling them to align with human instructions and enhance their capabilities in downstream tasks. Increasing instruction data substantially is a direct solution to align the model with a broader range of downstream tasks or notably improve its performance on a specific task. However, we find that large-scale increases in instruction data can damage the world knowledge previously stored in LLMs. To address this challenge, we propose LoRAMoE, a novelty framework that introduces several low-rank adapters (LoRA) and integrates them by using a router network, like a plugin version of Mixture of Experts (MoE). It freezes the backbone model and forces a portion of LoRAs to focus on leveraging world knowledge to solve downstream tasks, to alleviate world knowledge-edge forgetting. Experimental results show that, as the instruction data increases, LoRAMoE can significantly improve the ability to process downstream tasks, while maintaining the world knowledge stored in the LLM.

References (55)

Authors (16)

Shihan Dou (46 papers)
Enyu Zhou (12 papers)
Yan Liu (420 papers)
Songyang Gao (28 papers)
Jun Zhao (469 papers)
Wei Shen (181 papers)
Yuhao Zhou (78 papers)
Zhiheng Xi (37 papers)
Xiao Wang (507 papers)
Xiaoran Fan (23 papers)
Shiliang Pu (106 papers)
Jiang Zhu (82 papers)
Rui Zheng (79 papers)
Tao Gui (127 papers)
Qi Zhang (785 papers)
Xuanjing Huang (287 papers)

Citations (21)

View on Semantic Scholar

Summary

Overview of LoRAMoE

Supervised fine-tuning (SFT) is commonly employed to enhance the performance of LLMs in specific tasks by aligning them with human instructions. However, the underlying challenge with SFT is the tendency of the models to forget stored world knowledge—referred to as knowledge forgetting—when fine-tuning data is substantially increased.

Addressing Knowledge Forgetting

A novel solution, LoRAMoE, has been presented to mitigate the issue of knowledge forgetting in LLMs while maintaining efficiency in task completion. LoRAMoE is a modification of the Mixture of Experts (MoE) architecture that adds multiple plug-in experts in each model layer. These experts are trained to either focus on task-related data or on preserving world knowledge, depending on the data type they are handling. Notably, the main model parameters are frozen during training, which helps preserve previously acquired knowledge.

Expert Coordination through Constraints

The implementation of localized balancing constraints is a critical aspect of LoRAMoE's architecture. This design choice enables the intelligent division of experts into two groups, each dedicated to different tasks. One group concentrates on learning from a wide range of downstream tasks data, while the other is fine-tuned to align the world knowledge stored inside the LLM with human instructions. This strategic division allows LoRAMoE to preserve world knowledge while still enhancing performance in other areas.

Experimental Validation

The effectiveness of LoRAMoE has been verified through extensive testing. Results show that increasing instruction data no longer causes knowledge forgetting with the LoRAMoE approach. The model retains its world knowledge and even outperforms the traditional single-task fine-tuning methods on specific tasks. Moreover, LoRAMoE demonstrates a potential for efficient multi-task learning, as it improves performance across a variety of downstream tasks.

To further validate the capability specialization of LoRAMoE, visualizations of expert utilization were provided. These visualizations indicated that, depending on the task, the model allocates weight to the expert group with the most relevant skills, whether it be focusing on task-related data or world knowledge.

Conclusion

In summary, LoRAMoE emerges as a promising method for training LLMs. It offers a solution to the paramount issue of knowledge forgetting during large-scale fine-tuning, without compromising the performance of LLMs on a wide range of tasks. This approach ensures that the integrity of world knowledge is protected while also catering to the diverse requirements of downstream applications.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/22146921/status/1739033629081342311

https://twitter.com/woojinrad/status/1743454342353490336

https://twitter.com/paulcx/status/1800659796871545020

https://twitter.com/32910865/status/1738929558211248189

YouTube

Show All Videos