Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Modular LLMs by Building and Reusing a Library of LoRAs (2405.11157v1)

Published 18 May 2024 in cs.LG and cs.CL

Abstract: The growing number of parameter-efficient adaptations of a base LLM calls for studying whether we can reuse such trained adapters to improve performance for new tasks. We study how to best build a library of adapters given multi-task data and devise techniques for both zero-shot and supervised task generalization through routing in such library. We benchmark existing approaches to build this library and introduce model-based clustering, MBC, a method that groups tasks based on the similarity of their adapter parameters, indirectly optimizing for transfer across the multi-task dataset. To re-use the library, we present a novel zero-shot routing mechanism, Arrow, which enables dynamic selection of the most relevant adapters for new inputs without the need for retraining. We experiment with several LLMs, such as Phi-2 and Mistral, on a wide array of held-out tasks, verifying that MBC-based adapters and Arrow routing lead to superior generalization to new tasks. We make steps towards creating modular, adaptable LLMs that can match or outperform traditional joint training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (98)
  1. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  3. Adapterhub playground: Simple and flexible few-shot learning with adapters. arXiv preprint arXiv:2108.08103, 2021.
  4. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  5. Belofsky, J. Token-level adaptation of lora adapters for downstream task generalization, 2023.
  6. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, pp.  7432–7439, 2020.
  7. Stochastic filter groups for multi-task cnns: Learning specialist and generalist convolution kernels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1385–1394, 2019.
  8. Breiman, L. Bagging predictors. Machine learning, 24:123–140, 1996.
  9. Multi-head adapter routing for cross-task generalization, 2023.
  10. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pp.  132–149, 2018.
  11. Caruana, R. Multitask learning. Machine learning, 28:41–75, 1997.
  12. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  13. Mod-squad: Designing mixture of experts as modular multi-task learners, 2022.
  14. Efficient hierarchical domain adaptation for pretrained language models. arXiv preprint arXiv:2112.08786, 2021.
  15. Adaptersoup: Weight averaging to improve generalization of pretrained language models. arXiv preprint arXiv:2302.07027, 2023a.
  16. Language and task arithmetic with parameter-efficient layers for zero-shot summarization. arXiv preprint arXiv:2311.09344, 2023b.
  17. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.
  18. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  19. Model merging by uncertainty-based gradient matching. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=D7KJmfEDQP.
  20. Dietterich, T. G. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Springer, 2000.
  21. Mitigating task interference in multi-task learning via explicit task routing with non-learnable primitives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7756–7765, 2023.
  22. Enslm: Ensemble language model for data diversity by semantic clustering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  2954–2967, 2021.
  23. EleutherAI. Multiple-choice normalization. https://blog.eleuther.ai/multiple-choice-normalization/, 2021. Accessed: 2024-05-12.
  24. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
  25. Memory efficient continual learning with transformers. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=U07d1Y-x2E.
  26. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022. URL http://jmlr.org/papers/v23/21-0998.html.
  27. Efficiently identifying task groupings for multi-task learning. Advances in Neural Information Processing Systems, 34:27503–27516, 2021.
  28. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. PMLR, 2020.
  29. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379, 2023.
  30. Hard mixtures of experts for large scale weakly supervised vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  6865–6873, 2017.
  31. Sparsely activated mixture-of-experts are robust multi-task learners. arXiv preprint arXiv:2204.07689, 2022.
  32. Scaling expert language models with unsupervised domain discovery. arXiv preprint arXiv:2303.14177, 2023.
  33. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pp. 2790–2799, 2019. URL http://proceedings.mlr.press/v97/houlsby19a/houlsby19a.pdf.
  34. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  35. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269, 2023.
  36. Lorahub: Efficient cross-task generalization via dynamic lora composition, 2024.
  37. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
  38. Exploring the benefits of training expert language models over instruction tuning, 2023.
  39. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  40. Mixtral of experts, 2024.
  41. Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849, 2022.
  42. Population parameter averaging (papa), 2023.
  43. Parameter-efficient multi-task fine-tuning for Transformers via shared hypernetworks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp.  565–576, August 2021. URL https://aclanthology.org/2021.acl-long.47.
  44. The power of scale for parameter-efficient prompt tuning, 2021. URL https://arxiv.org/pdf/2104.08691.pdf.
  45. Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306, 2022.
  46. Specializing word embeddings (for parsing) by information bottleneck. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2744–2754, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1276. URL https://www.aclweb.org/anthology/D19-1276.
  47. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4582–4597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353.
  48. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics, pp.  150–157, 2003.
  49. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022. URL https://arxiv.org/abs/2205.05638.
  50. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
  51. Routing to the expert: Efficient reward-guided ensemble of large language models. arXiv preprint arXiv:2311.08692, 2023.
  52. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  53. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
  54. Microsoft Research. Phi-2: The Surprising Power of Small Language Models, 2023.
  55. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  56. Privacy in deep learning: A survey. arXiv preprint arXiv:2004.12254, 2020.
  57. Soft merging of experts with adaptive routing. arXiv preprint arXiv:2306.03745, 2023.
  58. Learning to route among specialized experts for zero-shot generalization. arXiv preprint arXiv: 2402.05859, 2024.
  59. Nakatsukasa, Y. The low-rank eigenvalue problem. arXiv preprint arXiv:1905.11490, 2019.
  60. Continual learning via local module composition. Advances in Neural Information Processing Systems, 34, 2021. URL https://proceedings.neurips.cc/paper/2021/file/fe5e7cb609bdbe6d62449d61849c38b0-Paper.pdf.
  61. A case study of instruction tuning with mixture of parameter-efficient experts. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
  62. AdapterFusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pp.  487–503, April 2021. URL https://aclanthology.org/2021.eacl-main.39.
  63. Modular deep learning. arXiv preprint arXiv:2302.11529, 2023. URL https://arxiv.org/pdf/2302.11529.pdf.
  64. Combining parameter-efficient modules for task-level generalisation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  687–702, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.49.
  65. Adapters: A unified library for parameter-efficient and modular transfer learning. arXiv preprint arXiv:2311.11077, 2023.
  66. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020. URL https://www.jmlr.org/papers/volume21/20-074/20-074.pdf.
  67. Model ratatouille: Recycling diverse models for out-of-distribution generalization. In International Conference on Machine Learning, pp. 28656–28679. PMLR, 2023.
  68. Nevergrad - A gradient-free optimization platform. https://GitHub.com/FacebookResearch/Nevergrad, 2018.
  69. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  70. Ziplora: Any subject in any style by effectively merging loras. arXiv preprint arXiv:2311.13600, 2023.
  71. Large language model routing with benchmark datasets. arXiv preprint arXiv:2309.15789, 2023.
  72. Many task learning with task routing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1375–1384, 2019.
  73. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  74. Merging by matching models in task subspaces. arXiv preprint arXiv:2312.04339, 2023.
  75. Exploring and predicting transferability across NLP tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  7882–7926, Online, November 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.635. URL https://aclanthology.org/2020.emnlp-main.635.
  76. Exploring and predicting transferability across nlp tasks. arXiv preprint arXiv:2005.00770, 2020b.
  77. Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904, 2021.
  78. Task adaptive parameter sharing for multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7561–7570, 2022.
  79. Adamix: Mixture-of-adaptations for parameter-efficient model tuning. arXiv preprint arXiv:2205.12410, 2022a.
  80. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022b.
  81. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023.
  82. Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=F1vEjWK-lH_.
  83. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  139–149, 2022c.
  84. Batchensemble: An alternative approach to efficient ensemble and lifelong learning, 2020.
  85. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668, 2023.
  86. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022.
  87. π𝜋\piitalic_π-tuning: Transferring multimodal foundation models with optimal multi-task interpolation. In International Conference on Machine Learning, pp. 37713–37727. PMLR, 2023.
  88. Mole: Mixture of lora experts. In International Conference on Learning Representations, ICLR 2024, 2024. URL https://openreview.net/forum?id=uWvKBCYh4S.
  89. Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems, 36, 2024.
  90. Adamerging: Adaptive model merging for multi-task learning. arXiv preprint arXiv:2310.02575, 2023.
  91. Eliciting and understanding cross-task skills with task-level mixture-of-experts. arXiv preprint arXiv:2205.12701, 2022.
  92. Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. arXiv preprint arXiv:2309.05444, 2023.
  93. Adaptive knowledge sharing in multi-task learning: Improving low-resource neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  656–661, 2018.
  94. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  95. Composing parameter-efficient modules with arithmetic operations. arXiv preprint arXiv:2306.14870, 2023a.
  96. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  97. A modulation module for multi-task learning with applications in image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  401–416, 2018.
  98. Efficiently tuned parameters are task embeddings. arXiv preprint arXiv:2210.11705, 2022.
Citations (14)

Summary

  • The paper introduces a novel Model-Based Clustering method to group task-specific LoRA adapters, efficiently compressing information from multiple tasks.
  • It explores various routing strategies, including the innovative zero-shot Arrow Routing, to dynamically select the most relevant adapters for new tasks.
  • Experimental results demonstrate that the modular LoRA library outperforms standard methods in both zero-shot and supervised scenarios.

Towards Modular LLMs by Building and Reusing a Library of LoRAs

Introduction

Parameter-efficient fine-tuning (PEFT) techniques like the Low-Rank Adaptation (LoRA) method have made it simpler to adapt LLMs to a wide variety of tasks. Imagine having not just one or two but hundreds of such adaptations readily available. This paper explores how we can effectively build and leverage a library of these LoRA adapters to make LLMs more modular, adaptable, and generally proficient at handling unseen tasks.

Building the LoRA Library

The core idea presented involves creating a library of task-specific adapters that enable the LLM to perform well on both seen and unseen tasks. The main methods proposed to build this library include:

  • Private Adapters: Each adapter is trained individually on a specific task. This method works well in decentralized setups but doesn't leverage multi-task learning.
  • Shared Adapter: A single adapter is trained on data from all tasks combined, promoting task transfer but risking negative transfer due to task interference.
  • Poly/MHR Adapters: These methods train a set of 'basis' adapters on multi-task data and then expand them into the final adapters for each task through linear combinations.

The novel contribution here is Model-Based Clustering (MBC), a two-stage approach:

  1. Train individual LoRA adapters for each task for a specified initial number of steps.
  2. Cluster these adapters based on the similarity of their weights, then train one adapter per cluster for the remaining steps.

This method allows for transferring useful information between similar tasks, effectively compressing the information from a large set of tasks into fewer, more generalized adapters.

Reusing the LoRA Library

Once a library of adapters is built, selecting and reusing the right adapters for new tasks is crucial. The authors explore several routing strategies for this purpose:

  1. μ Routing: All adapters are equally weighted, essentially averaging their outputs.
  2. Task Predictor (TP) Routing: A classifier is trained to predict the relevant task for a given input, guiding which adapters to use.
  3. Centroid Matching (CM) Routing: Each adapter has a prototype representation obtained from its training data, and the adapters are selected based on the similarity of this prototype to the given input.
  4. Arrow Routing: This novel zero-shot method uses Singular Value Decomposition (SVD) to find the most relevant adapters based on the direction of maximum variance induced by their parameters.

Arrow Routing offers a lightweight, efficient way to dynamically select the best-suited adapters without requiring access to the training data, thus fitting well into decentralized, asynchronous learning environments.

Experimental Results

The experimental evaluation spans both zero-shot and supervised learning scenarios on LLMs like Phi-2 and Mistral. The results show:

  • Zero-Shot Performance: Libraries built using MBC consistently outperformed other methods, with Arrow Routing enhancing the performance further, especially when used with a large library of adapters.
  • Supervised Adaptation: MBC combined with Poly routing achieves the best results, showing that both effective clustering and routing are key to leveraging the full potential of the adapter library.

Implications and Future Work

This research paves the way for more scalable and flexible usage of LLMs, leveraging decentralized and collaborative model training. By efficiently combining and routing through a myriad of pre-trained adapters, models can be made more adaptable to new tasks without retraining from scratch.

Potential future directions include:

  • Extending the techniques to other types of adapters beyond LoRA.
  • Scaling the approach to larger models and more diverse datasets.
  • Investigating usage in continuous learning settings where new tasks continuously emerge.

By focusing on modular and parameter-efficient executions, this approach could significantly reduce the computational footprint and increase accessibility for smaller research groups or applications with constrained resources.

Conclusion

This paper makes significant strides in making LLMs more modular and adaptable through the innovative use of a library of LoRA adapters. By addressing both the construction and utilization of such a library, and introducing methods like MBC and Arrow Routing, it demonstrates how we can push the boundaries of efficient, flexible, and robust multi-task learning. The future looks promising for more scalable and collaborative advancements in LLMs.

Youtube Logo Streamline Icon: https://streamlinehq.com