Towards Modular LLMs by Building and Reusing a Library of LoRAs (2405.11157v1)
Abstract: The growing number of parameter-efficient adaptations of a base LLM calls for studying whether we can reuse such trained adapters to improve performance for new tasks. We study how to best build a library of adapters given multi-task data and devise techniques for both zero-shot and supervised task generalization through routing in such library. We benchmark existing approaches to build this library and introduce model-based clustering, MBC, a method that groups tasks based on the similarity of their adapter parameters, indirectly optimizing for transfer across the multi-task dataset. To re-use the library, we present a novel zero-shot routing mechanism, Arrow, which enables dynamic selection of the most relevant adapters for new inputs without the need for retraining. We experiment with several LLMs, such as Phi-2 and Mistral, on a wide array of held-out tasks, verifying that MBC-based adapters and Arrow routing lead to superior generalization to new tasks. We make steps towards creating modular, adaptable LLMs that can match or outperform traditional joint training.
- Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Adapterhub playground: Simple and flexible few-shot learning with adapters. arXiv preprint arXiv:2108.08103, 2021.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- Belofsky, J. Token-level adaptation of lora adapters for downstream task generalization, 2023.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, pp. 7432–7439, 2020.
- Stochastic filter groups for multi-task cnns: Learning specialist and generalist convolution kernels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1385–1394, 2019.
- Breiman, L. Bagging predictors. Machine learning, 24:123–140, 1996.
- Multi-head adapter routing for cross-task generalization, 2023.
- Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pp. 132–149, 2018.
- Caruana, R. Multitask learning. Machine learning, 28:41–75, 1997.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Mod-squad: Designing mixture of experts as modular multi-task learners, 2022.
- Efficient hierarchical domain adaptation for pretrained language models. arXiv preprint arXiv:2112.08786, 2021.
- Adaptersoup: Weight averaging to improve generalization of pretrained language models. arXiv preprint arXiv:2302.07027, 2023a.
- Language and task arithmetic with parameter-efficient layers for zero-shot summarization. arXiv preprint arXiv:2311.09344, 2023b.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Model merging by uncertainty-based gradient matching. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=D7KJmfEDQP.
- Dietterich, T. G. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Springer, 2000.
- Mitigating task interference in multi-task learning via explicit task routing with non-learnable primitives. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7756–7765, 2023.
- Enslm: Ensemble language model for data diversity by semantic clustering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2954–2967, 2021.
- EleutherAI. Multiple-choice normalization. https://blog.eleuther.ai/multiple-choice-normalization/, 2021. Accessed: 2024-05-12.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
- Memory efficient continual learning with transformers. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=U07d1Y-x2E.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022. URL http://jmlr.org/papers/v23/21-0998.html.
- Efficiently identifying task groupings for multi-task learning. Advances in Neural Information Processing Systems, 34:27503–27516, 2021.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. PMLR, 2020.
- Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379, 2023.
- Hard mixtures of experts for large scale weakly supervised vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6865–6873, 2017.
- Sparsely activated mixture-of-experts are robust multi-task learners. arXiv preprint arXiv:2204.07689, 2022.
- Scaling expert language models with unsupervised domain discovery. arXiv preprint arXiv:2303.14177, 2023.
- Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pp. 2790–2799, 2019. URL http://proceedings.mlr.press/v97/houlsby19a/houlsby19a.pdf.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269, 2023.
- Lorahub: Efficient cross-task generalization via dynamic lora composition, 2024.
- Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
- Exploring the benefits of training expert language models over instruction tuning, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Mixtral of experts, 2024.
- Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849, 2022.
- Population parameter averaging (papa), 2023.
- Parameter-efficient multi-task fine-tuning for Transformers via shared hypernetworks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 565–576, August 2021. URL https://aclanthology.org/2021.acl-long.47.
- The power of scale for parameter-efficient prompt tuning, 2021. URL https://arxiv.org/pdf/2104.08691.pdf.
- Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306, 2022.
- Specializing word embeddings (for parsing) by information bottleneck. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2744–2754, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1276. URL https://www.aclweb.org/anthology/D19-1276.
- Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4582–4597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353.
- Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics, pp. 150–157, 2003.
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning, 2022. URL https://arxiv.org/abs/2205.05638.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
- Routing to the expert: Efficient reward-guided ensemble of large language models. arXiv preprint arXiv:2311.08692, 2023.
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
- Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
- Microsoft Research. Phi-2: The Surprising Power of Small Language Models, 2023.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- Privacy in deep learning: A survey. arXiv preprint arXiv:2004.12254, 2020.
- Soft merging of experts with adaptive routing. arXiv preprint arXiv:2306.03745, 2023.
- Learning to route among specialized experts for zero-shot generalization. arXiv preprint arXiv: 2402.05859, 2024.
- Nakatsukasa, Y. The low-rank eigenvalue problem. arXiv preprint arXiv:1905.11490, 2019.
- Continual learning via local module composition. Advances in Neural Information Processing Systems, 34, 2021. URL https://proceedings.neurips.cc/paper/2021/file/fe5e7cb609bdbe6d62449d61849c38b0-Paper.pdf.
- A case study of instruction tuning with mixture of parameter-efficient experts. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023.
- AdapterFusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pp. 487–503, April 2021. URL https://aclanthology.org/2021.eacl-main.39.
- Modular deep learning. arXiv preprint arXiv:2302.11529, 2023. URL https://arxiv.org/pdf/2302.11529.pdf.
- Combining parameter-efficient modules for task-level generalisation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 687–702, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.eacl-main.49.
- Adapters: A unified library for parameter-efficient and modular transfer learning. arXiv preprint arXiv:2311.11077, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020. URL https://www.jmlr.org/papers/volume21/20-074/20-074.pdf.
- Model ratatouille: Recycling diverse models for out-of-distribution generalization. In International Conference on Machine Learning, pp. 28656–28679. PMLR, 2023.
- Nevergrad - A gradient-free optimization platform. https://GitHub.com/FacebookResearch/Nevergrad, 2018.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Ziplora: Any subject in any style by effectively merging loras. arXiv preprint arXiv:2311.13600, 2023.
- Large language model routing with benchmark datasets. arXiv preprint arXiv:2309.15789, 2023.
- Many task learning with task routing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1375–1384, 2019.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Merging by matching models in task subspaces. arXiv preprint arXiv:2312.04339, 2023.
- Exploring and predicting transferability across NLP tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7882–7926, Online, November 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.635. URL https://aclanthology.org/2020.emnlp-main.635.
- Exploring and predicting transferability across nlp tasks. arXiv preprint arXiv:2005.00770, 2020b.
- Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904, 2021.
- Task adaptive parameter sharing for multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7561–7570, 2022.
- Adamix: Mixture-of-adaptations for parameter-efficient model tuning. arXiv preprint arXiv:2205.12410, 2022a.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022b.
- How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023.
- Gradient vaccine: Investigating and improving multi-task optimization in massively multilingual models. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=F1vEjWK-lH_.
- Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 139–149, 2022c.
- Batchensemble: An alternative approach to efficient ensemble and lifelong learning, 2020.
- Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668, 2023.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022.
- π𝜋\piitalic_π-tuning: Transferring multimodal foundation models with optimal multi-task interpolation. In International Conference on Machine Learning, pp. 37713–37727. PMLR, 2023.
- Mole: Mixture of lora experts. In International Conference on Learning Representations, ICLR 2024, 2024. URL https://openreview.net/forum?id=uWvKBCYh4S.
- Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems, 36, 2024.
- Adamerging: Adaptive model merging for multi-task learning. arXiv preprint arXiv:2310.02575, 2023.
- Eliciting and understanding cross-task skills with task-level mixture-of-experts. arXiv preprint arXiv:2205.12701, 2022.
- Pushing mixture of experts to the limit: Extremely parameter efficient moe for instruction tuning. arXiv preprint arXiv:2309.05444, 2023.
- Adaptive knowledge sharing in multi-task learning: Improving low-resource neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 656–661, 2018.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Composing parameter-efficient modules with arithmetic operations. arXiv preprint arXiv:2306.14870, 2023a.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
- A modulation module for multi-task learning with applications in image retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 401–416, 2018.
- Efficiently tuned parameters are task embeddings. arXiv preprint arXiv:2210.11705, 2022.