Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoELoRA: Contrastive Learning Guided Mixture of Experts on Parameter-Efficient Fine-Tuning for Large Language Models (2402.12851v1)

Published 20 Feb 2024 in cs.CL

Abstract: Fine-tuning is often necessary to enhance the adaptability of LLMs (LLM) to downstream tasks. Nonetheless, the process of updating billions of parameters demands significant computational resources and training time, which poses a substantial obstacle to the widespread application of large-scale models in various scenarios. To address this issue, Parameter-Efficient Fine-Tuning (PEFT) has emerged as a prominent paradigm in recent research. However, current PEFT approaches that employ a limited set of global parameters (such as LoRA, which adds low-rank approximation matrices to all weights) face challenges in flexibly combining different computational modules in downstream tasks. In this work, we introduce a novel PEFT method: MoELoRA. We consider LoRA as Mixture of Experts (MoE), and to mitigate the random routing phenomenon observed in MoE, we propose the utilization of contrastive learning to encourage experts to learn distinct features. We conducted experiments on 11 tasks in math reasoning and common-sense reasoning benchmarks. With the same number of parameters, our approach outperforms LoRA significantly. In math reasoning, MoELoRA achieved an average performance that was 4.2% higher than LoRA, and demonstrated competitive performance compared to the 175B GPT-3.5 on several benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
  4. François Chollet. 2019. On the measure of intelligence. arXiv preprint arXiv:1911.01547.
  5. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  6. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  8. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  9. Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696.
  10. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  11. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 5(3):220–235.
  12. Cert: Contrastive self-supervised learning for language understanding. arXiv preprint arXiv:2005.12766.
  13. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270.
  14. Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913.
  15. Declutr: Deep contrastive learning for unsupervised textual representations. arXiv preprint arXiv:2006.03659.
  16. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), volume 2, pages 1735–1742. IEEE.
  17. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366.
  18. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738.
  19. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523–533.
  20. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  21. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  22. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933.
  23. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269.
  24. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  25. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  26. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
  27. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
  28. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  29. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, pages 6265–6274. PMLR.
  30. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
  31. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.
  32. Versatile black-box optimization. In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, pages 620–628.
  33. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
  34. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789.
  35. Ishan Misra and Laurens van der Maaten. 2020. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6707–6717.
  36. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191.
  37. Mad-x: An adapter-based framework for multi-task cross-lingual transfer. arXiv preprint arXiv:2005.00052.
  38. Hash layers for large sparse models. Advances in Neural Information Processing Systems, 34:17555–17566.
  39. Subhro Roy and Dan Roth. 2016. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413.
  40. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  41. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
  42. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  43. Attention is all you need. Advances in neural information processing systems, 30.
  44. Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models. arXiv preprint arXiv:2205.12410, 1(2):4.
  45. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
  46. Clear: Contrastive learning for sentence representation. arXiv preprint arXiv:2012.15466.
  47. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512.
  48. Moefication: Transformer feed-forward layers are mixtures of experts. In Findings of the Association for Computational Linguistics: ACL 2022, pages 877–890.
  49. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6002–6012.
  50. Taming sparsely activated transformer with stochastic experts. arXiv preprint arXiv:2110.04260.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Tongxu Luo (9 papers)
  2. Jiahe Lei (7 papers)
  3. Fangyu Lei (19 papers)
  4. Weihao Liu (19 papers)
  5. Shizhu He (51 papers)
  6. Jun Zhao (469 papers)
  7. Kang Liu (207 papers)
Citations (9)