LoRAMoE: Alleviate World Knowledge Forgetting in Large Language Models via MoE-Style Plugin (2312.09979v4)
Abstract: Supervised fine-tuning (SFT) is a crucial step for LLMs, enabling them to align with human instructions and enhance their capabilities in downstream tasks. Increasing instruction data substantially is a direct solution to align the model with a broader range of downstream tasks or notably improve its performance on a specific task. However, we find that large-scale increases in instruction data can damage the world knowledge previously stored in LLMs. To address this challenge, we propose LoRAMoE, a novelty framework that introduces several low-rank adapters (LoRA) and integrates them by using a router network, like a plugin version of Mixture of Experts (MoE). It freezes the backbone model and forces a portion of LoRAs to focus on leveraging world knowledge to solve downstream tasks, to alleviate world knowledge-edge forgetting. Experimental results show that, as the instruction data increases, LoRAMoE can significantly improve the ability to process downstream tasks, while maintaining the world knowledge stored in the LLM.
- A review on language models as knowledge bases. arXiv preprint arXiv:2204.06031, 2022.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
- Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023.
- Lifelong language pretraining with distribution-specialized experts. In International Conference on Machine Learning, pp. 5383–5395. PMLR, 2023.
- On the representation collapse of sparse mixture of experts. Advances in Neural Information Processing Systems, 35:34600–34613, 2022.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904, 2022.
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp. 5547–5569. PMLR, 2022.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv: Learning,arXiv: Learning, Jan 2021.
- Trends in integration of knowledge and large language models: A survey and taxonomy of methods, benchmarks, and applications. arXiv preprint arXiv:2311.05876, 2023.
- The flores evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english. arXiv preprint arXiv:1902.01382, 2019.
- Episodic memory reader: Learning what to remember for question answering from streaming data. arXiv preprint arXiv:1903.06164, 2019.
- Towards a unified view of parameter-efficient transfer learning. Cornell University - arXiv,Cornell University - arXiv, Oct 2021a.
- On the effectiveness of adapter-based tuning for pretrained language model adaptation. arXiv preprint arXiv:2106.03164, 2021b.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269, 2023.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023.
- Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 252–262, 2018.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
- Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
- Gshard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2020.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, 2021.
- The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
- Question and answer test-train overlap in open-domain question answering datasets. arXiv preprint arXiv:2008.02637, 2020.
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
- Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications. arXiv preprint arXiv:2310.18339, 2023.
- The microsoft toolkit of multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:2002.07972, 2020.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745, 2018.
- Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering. arXiv preprint arXiv:2211.05655, 2022.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35:26462–26477, 2022.
- Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
- Answering complex open-domain questions through iterative query generation. arXiv preprint arXiv:1910.07000, 2019.
- Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
- How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020.
- Duorc: Towards complex language understanding with paraphrased reading comprehension. arXiv preprint arXiv:1804.07927, 2018.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2016.
- S-lora: Serving thousands of concurrent lora adapters. arXiv preprint arXiv:2311.03285, 2023.
- Black-box tuning for language-model-as-a-service. In International Conference on Machine Learning, pp. 20841–20855. PMLR, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Computational analysis of the role of the hippocampus in memory. Hippocampus, 4(3):374–391, 1994.
- A closer look at the hippocampus and memory. Trends in cognitive sciences, 21(8):577–588, 2017.
- Orthogonal subspace learning for language model continual learning. arXiv preprint arXiv:2310.14152, 2023.
- Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814, 2017.
- Go wider instead of deeper. Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8779–8787, Jul 2022. doi: 10.1609/aaai.v36i8.20858. URL http://dx.doi.org/10.1609/aaai.v36i8.20858.
- Kola: Carefully benchmarking world knowledge of large language models. arXiv preprint arXiv:2306.09296, 2023.
- Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. arXiv preprint arXiv:2309.05444, 2023.
- Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885, 2018.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
- Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114, 2022.
- Shihan Dou (46 papers)
- Enyu Zhou (12 papers)
- Yan Liu (420 papers)
- Songyang Gao (28 papers)
- Jun Zhao (469 papers)
- Wei Shen (181 papers)
- Yuhao Zhou (78 papers)
- Zhiheng Xi (37 papers)
- Xiao Wang (507 papers)
- Xiaoran Fan (23 papers)
- Shiliang Pu (106 papers)
- Jiang Zhu (82 papers)
- Rui Zheng (79 papers)
- Tao Gui (127 papers)
- Qi Zhang (785 papers)
- Xuanjing Huang (287 papers)