Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin (2312.09979v4)

Published 15 Dec 2023 in cs.CL

Abstract: Supervised fine-tuning (SFT) is a crucial step for LLMs, enabling them to align with human instructions and enhance their capabilities in downstream tasks. Increasing instruction data substantially is a direct solution to align the model with a broader range of downstream tasks or notably improve its performance on a specific task. However, we find that large-scale increases in instruction data can damage the world knowledge previously stored in LLMs. To address this challenge, we propose LoRAMoE, a novelty framework that introduces several low-rank adapters (LoRA) and integrates them by using a router network, like a plugin version of Mixture of Experts (MoE). It freezes the backbone model and forces a portion of LoRAs to focus on leveraging world knowledge to solve downstream tasks, to alleviate world knowledge-edge forgetting. Experimental results show that, as the instruction data increases, LoRAMoE can significantly improve the ability to process downstream tasks, while maintaining the world knowledge stored in the LLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. A review on language models as knowledge bases. arXiv preprint arXiv:2204.06031, 2022.
  2. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
  3. Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023.
  4. Lifelong language pretraining with distribution-specialized experts. In International Conference on Machine Learning, pp.  5383–5395. PMLR, 2023.
  5. On the representation collapse of sparse mixture of experts. Advances in Neural Information Processing Systems, 35:34600–34613, 2022.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  7. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904, 2022.
  8. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.  5547–5569. PMLR, 2022.
  9. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv: Learning,arXiv: Learning, Jan 2021.
  10. Trends in integration of knowledge and large language models: A survey and taxonomy of methods, benchmarks, and applications. arXiv preprint arXiv:2311.05876, 2023.
  11. The flores evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english. arXiv preprint arXiv:1902.01382, 2019.
  12. Episodic memory reader: Learning what to remember for question answering from streaming data. arXiv preprint arXiv:1903.06164, 2019.
  13. Towards a unified view of parameter-efficient transfer learning. Cornell University - arXiv,Cornell University - arXiv, Oct 2021a.
  14. On the effectiveness of adapter-based tuning for pretrained language model adaptation. arXiv preprint arXiv:2106.03164, 2021b.
  15. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.  2790–2799. PMLR, 2019.
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  17. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269, 2023.
  18. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  19. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023.
  20. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  252–262, 2018.
  21. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  22. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
  23. Gshard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2020.
  24. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  3045–3059, 2021.
  25. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  26. Question and answer test-train overlap in open-domain question answering datasets. arXiv preprint arXiv:2008.02637, 2020.
  27. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
  28. Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications. arXiv preprint arXiv:2310.18339, 2023.
  29. The microsoft toolkit of multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:2002.07972, 2020.
  30. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
  31. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018.
  32. Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering. arXiv preprint arXiv:2211.05655, 2022.
  33. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  34. St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35:26462–26477, 2022.
  35. Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
  36. Answering complex open-domain questions through iterative query generation. arXiv preprint arXiv:1910.07000, 2019.
  37. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
  38. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020.
  39. Duorc: Towards complex language understanding with paraphrased reading comprehension. arXiv preprint arXiv:1804.07927, 2018.
  40. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  41. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2016.
  42. S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023.
  43. Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning, pp.  20841–20855. PMLR, 2022.
  44. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  45. Computational analysis of the role of the hippocampus in memory. Hippocampus, 4(3):374–391, 1994.
  46. A closer look at the hippocampus and memory. Trends in cognitive sciences, 21(8):577–588, 2017.
  47. Orthogonal subspace learning for language model continual learning. arXiv preprint arXiv:2310.14152, 2023.
  48. Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814, 2017.
  49. Go wider instead of deeper. Proceedings of the AAAI Conference on Artificial Intelligence, pp.  8779–8787, Jul 2022. doi: 10.1609/aaai.v36i8.20858. URL http://dx.doi.org/10.1609/aaai.v36i8.20858.
  50. Kola: Carefully benchmarking world knowledge of large language models. arXiv preprint arXiv:2306.09296, 2023.
  51. Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. arXiv preprint arXiv:2309.05444, 2023.
  52. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885, 2018.
  53. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
  54. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  55. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Shihan Dou (46 papers)
  2. Enyu Zhou (12 papers)
  3. Yan Liu (420 papers)
  4. Songyang Gao (28 papers)
  5. Jun Zhao (469 papers)
  6. Wei Shen (181 papers)
  7. Yuhao Zhou (78 papers)
  8. Zhiheng Xi (37 papers)
  9. Xiao Wang (507 papers)
  10. Xiaoran Fan (23 papers)
  11. Shiliang Pu (106 papers)
  12. Jiang Zhu (82 papers)
  13. Rui Zheng (79 papers)
  14. Tao Gui (127 papers)
  15. Qi Zhang (785 papers)
  16. Xuanjing Huang (287 papers)
Citations (21)

Summary

Overview of LoRAMoE

Supervised fine-tuning (SFT) is commonly employed to enhance the performance of LLMs in specific tasks by aligning them with human instructions. However, the underlying challenge with SFT is the tendency of the models to forget stored world knowledge—referred to as knowledge forgetting—when fine-tuning data is substantially increased.

Addressing Knowledge Forgetting

A novel solution, LoRAMoE, has been presented to mitigate the issue of knowledge forgetting in LLMs while maintaining efficiency in task completion. LoRAMoE is a modification of the Mixture of Experts (MoE) architecture that adds multiple plug-in experts in each model layer. These experts are trained to either focus on task-related data or on preserving world knowledge, depending on the data type they are handling. Notably, the main model parameters are frozen during training, which helps preserve previously acquired knowledge.

Expert Coordination through Constraints

The implementation of localized balancing constraints is a critical aspect of LoRAMoE's architecture. This design choice enables the intelligent division of experts into two groups, each dedicated to different tasks. One group concentrates on learning from a wide range of downstream tasks data, while the other is fine-tuned to align the world knowledge stored inside the LLM with human instructions. This strategic division allows LoRAMoE to preserve world knowledge while still enhancing performance in other areas.

Experimental Validation

The effectiveness of LoRAMoE has been verified through extensive testing. Results show that increasing instruction data no longer causes knowledge forgetting with the LoRAMoE approach. The model retains its world knowledge and even outperforms the traditional single-task fine-tuning methods on specific tasks. Moreover, LoRAMoE demonstrates a potential for efficient multi-task learning, as it improves performance across a variety of downstream tasks.

To further validate the capability specialization of LoRAMoE, visualizations of expert utilization were provided. These visualizations indicated that, depending on the task, the model allocates weight to the expert group with the most relevant skills, whether it be focusing on task-related data or world knowledge.

Conclusion

In summary, LoRAMoE emerges as a promising method for training LLMs. It offers a solution to the paramount issue of knowledge forgetting during large-scale fine-tuning, without compromising the performance of LLMs on a wide range of tasks. This approach ensures that the integrity of world knowledge is protected while also catering to the diverse requirements of downstream applications.

Youtube Logo Streamline Icon: https://streamlinehq.com