Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Higher Layers Need More LoRA Experts (2402.08562v1)

Published 13 Feb 2024 in cs.CL and cs.AI

Abstract: Parameter-efficient tuning (PEFT) techniques like low-rank adaptation (LoRA) offer training efficiency on LLMs, but their impact on model performance remains limited. Recent efforts integrate LoRA and Mixture-of-Experts (MoE) to improve the performance of PEFT methods. Despite promising results, research on improving the efficiency of LoRA with MoE is still in its early stages. Recent studies have shown that experts in the MoE architecture have different strengths and also exhibit some redundancy. Does this statement also apply to parameter-efficient MoE? In this paper, we introduce a novel parameter-efficient MoE method, \textit{\textbf{M}oE-L\textbf{o}RA with \textbf{L}ayer-wise Expert \textbf{A}llocation (MoLA)} for Transformer-based models, where each model layer has the flexibility to employ a varying number of LoRA experts. We investigate several architectures with varying layer-wise expert configurations. Experiments on six well-known NLP and commonsense QA benchmarks demonstrate that MoLA achieves equal or superior performance compared to all baselines. We find that allocating more LoRA experts to higher layers further enhances the effectiveness of models with a certain number of experts in total. With much fewer parameters, this allocation strategy outperforms the setting with the same number of experts in every layer. This work can be widely used as a plug-and-play parameter-efficient tuning approach for various applications. The code is available at https://github.com/GCYZSL/MoLA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  2. Riemannian walk for incremental learning: Understanding forgetting and intransigence. In Proceedings of the European conference on computer vision (ECCV), pp.  532–547, 2018.
  3. Sparse moe as the new dropout: Scaling dense and self-slimmable transformer. ICLR, 2023.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  5. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  6. Automatically constructing a corpus of sentential paraphrases. IJCNLP, 2005.
  7. Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment. arXiv preprint arXiv:2312.09979, 2023.
  8. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  9. Lora: Low-rank adaptation of large language models. ICLR, 2022.
  10. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  11. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  12. Prefix-tuning: Optimizing continuous prompts for generation. ACL, 2021.
  13. Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
  14. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. NeurIPS, 2022a.
  15. Moelora: An moe-based parameter efficient fine-tuning method for multi-task medical applications. arXiv preprint arXiv:2310.18339, 2023.
  16. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. ACL 2022, 2022b.
  17. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  18. Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS, 2022.
  19. Full parameter fine-tuning for large language models with limited resources. arXiv preprint arXiv:2306.09782, 2023.
  20. Can a suit of armor conduct electricity? a new dataset for open book question answering. EMNLP, 2018.
  21. Cross-task generalization via natural language crowdsourcing instructions. In 60th Annual Meeting of the Association for Computational Linguistics, ACL 2022, pp.  3470–3487. Association for Computational Linguistics (ACL), 2022.
  22. Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
  23. Multitask prompted training enables zero-shot task generalization. In ICLR 2022-Tenth International Conference on Learning Representations, 2022.
  24. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ICLR, 2017.
  25. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022.
  26. Commonsenseqa: A question answering challenge targeting commonsense knowledge. NAACL, 2019.
  27. What do llamas really think? revealing preference biases in language model representations. arXiv preprint arXiv:2311.18812, 2023.
  28. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  29. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  30. Glue: A multi-task benchmark and analysis platform for natural language understanding. ICLR, 2019.
  31. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  32. Pushing mixture of experts to the limit: extremely parameter efficient moe for instruction tuning. arXiv preprint arXiv:2309.05444, 2023.
  33. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023a.
  34. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  35. Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631, 2023.
  36. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202:08906, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Chongyang Gao (21 papers)
  2. Kezhen Chen (12 papers)
  3. Jinmeng Rao (19 papers)
  4. Baochen Sun (11 papers)
  5. Ruibo Liu (42 papers)
  6. Daiyi Peng (17 papers)
  7. Yawen Zhang (23 papers)
  8. Xiaoyuan Guo (14 papers)
  9. Jie Yang (516 papers)
  10. VS Subrahmanian (2 papers)
Citations (23)
Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit