Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mixture-of-LoRAs: An Efficient Multitask Tuning for Large Language Models (2403.03432v1)

Published 6 Mar 2024 in cs.CL and cs.AI

Abstract: Instruction Tuning has the potential to stimulate or enhance specific capabilities of LLMs. However, achieving the right balance of data is crucial to prevent catastrophic forgetting and interference between tasks. To address these limitations and enhance training flexibility, we propose the Mixture-of-LoRAs (MoA) architecture which is a novel and parameter-efficient tuning method designed for multi-task learning with LLMs. In this paper, we start by individually training multiple domain-specific LoRA modules using corresponding supervised corpus data. These LoRA modules can be aligned with the expert design principles observed in Mixture-of-Experts (MoE). Subsequently, we combine the multiple LoRAs using an explicit routing strategy and introduce domain labels to facilitate multi-task learning, which help prevent interference between tasks and ultimately enhances the performance of each individual task. Furthermore, each LoRA model can be iteratively adapted to a new domain, allowing for quick domain-specific adaptation. Experiments on diverse tasks demonstrate superior and robust performance, which can further promote the wide application of domain-specific LLMs.

Mixture-of-LoRAs: Revolutionizing Multitask Learning in LLMs Through Efficient Tuning

Introduction to Mixture-of-LoRAs Architecture

The ever-increasing complexity of tasks demandable from LLMs poses a challenge in maintaining their versatility while enhancing their specialization in domain-specific capabilities. Traditional methods, albeit effective, often encounter issues such as catastrophic forgetting and task interference particularly when deployed on multitask settings. Addressing these constraints, the Mixture-of-LoRAs (MoA) architecture presents an innovative, parameter-efficient tuning method designed to optimize multi-task learning in LLMs. MoA leverages individual domain-specific LoRA modules, trained on corresponding data sets, and integrates them using an explicit routing strategy. This integration not only averts interference among tasks but also enriches the model's performance on individual tasks, rendering the model adaptable to new domains swiftly.

Methodology Behind MoA

The MoA architecture adopts a two-stage methodology to enrich LLMs with multi-task learning capabilities without succumbing to the common pitfalls of conventional methods.

Learning Algorithm: Initially, separate LoRA modules are trained across varied domain tasks, ensuring domain-specific expertise while mitigating catastrophic forgetting. These modules, deemed as domain-specific experts, are then amalgamated employing a routing mechanism that effectively selects the appropriate expert during both training and inference stages.

Routing Strategy: A distinct feature of the MoA architecture is its sequence-level routing strategy, leveraging domain labels to orchestrate data flow across the LoRA experts. This strategy transcends the conventional token-level routing by facilitating precise expert selection, thereby enhancing the model's efficiency in inference and augmenting task-specific performance.

Architecture Realization: The practical embodiment of MoA situates multiple LoRA modules alongside each transformer layer of the LLM, each coupled with a router that governs the selection of the pertinent expert based on the task at hand. This setup not merely accommodates the simultaneous deployment of multiple domain tasks but also embraces the flexibility of expanding or optimizing individual modules independently.

Experimental Validation

The effectiveness of MoA was rigorously validated across a suite of SFT datasets pertaining to heterogeneous domains, including finance, medicine, and coding challenges among others. The experimental setup involved benchmarks on perplexity, BLEU, and ROUGE-L metrics, positioning MoA against traditional single-LoRA approaches and a mixed domain LoRA model. The results unequivocally indicated MoA's superior performance in enhancing the LLM's capability across various tasks, showcasing its robustness and adaptability.

Implications and Future Directions

MoA sets a precedent in multitask learning by introducing an efficient, scalable, and flexible architecture that conserves the integrity of domain-specific knowledge while fostering a synergetic environment for knowledge sharing among tasks. The architecture not only delineates a path forward for developing multifaceted LLMs but also opens avenues for research into optimizing routing strategies and exploring unsupervised domain adaptation methods.

Concluding Remarks

Mixture-of-LoRAs decidedly marks a progressive stride toward realizing truly versatile and adaptable LLMs capable of multitask learning. By mitigating task interference and enhancing model performance on individual domains, MoA materializes a significant leap in the exploration of domain-specific LLM applications, poised to inspire further advancements in the field of artificial intelligence and natural language processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Language models are few-shot learners.
  2. Chatlaw: Open-source legal large language model with integrated external knowledge bases.
  3. A review of sparse expert models in deep learning. arXiv preprint arXiv:2209.01667.
  4. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270.
  5. DEMix layers: Disentangling domains for modular language modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5557–5576, Seattle, United States. Association for Computational Linguistics.
  6. Parameter-efficient transfer learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR.
  7. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  8. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269.
  9. Llm-blender: Ensembling large language models with pairwise comparison and generative fusion. In Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (ACL 2023).
  10. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
  11. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, pages 6265–6274. PMLR.
  12. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus, 15(6).
  13. Ro{bert}a: A robustly optimized {bert} pretraining approach.
  14. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
  15. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  16. Webgpt: Browser-assisted question-answering with human feedback. In arXiv.
  17. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247.
  18. Zheng Lin Qingyi Si. 2023. Alpaca-cot: An instruction fine-tuning platform with instruction data collection and unified large language models interface. https://github.com/PhoebusSi/alpaca-CoT.
  19. Mixture-of-experts meets instruction tuning:a winning combination for large language models.
  20. Sql-palm: Improved large language model adaptation for text-to-sql.
  21. Bhaskar Tripathi. 2023. Pdf-gpt. https://github.com/bhaskatripathi/pdfGPT.
  22. Bloomberggpt: A large language model for finance.
  23. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196.
  24. Gpt4tools: Teaching llm to use tools via self-instruction.
  25. Chinese open instruction generalist: A preliminary release.
  26. Domain specialization as the key to make large language models disruptive: A comprehensive survey. arXiv preprint arXiv:2305.18703.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Wenfeng Feng (8 papers)
  2. Chuzhan Hao (4 papers)
  3. Yuewei Zhang (22 papers)
  4. Yu Han (96 papers)
  5. Hao Wang (1119 papers)
Citations (6)
Reddit Logo Streamline Icon: https://streamlinehq.com