Arcee's MergeKit: A Toolkit for Merging Large Language Models (2403.13257v3)
Abstract: The rapid expansion of the open-source LLM landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pretrained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to utilize each other's strengths. Model merging facilitates the creation of multitask models without the need for additional training, offering a promising avenue for enhancing model performance and versatility. By preserving the intrinsic capabilities of the original models, model merging addresses complex challenges in AI - including the difficulties of catastrophic forgetting and multitask learning. To support this expanding area of research, we introduce MergeKit, a comprehensive, open-source library designed to facilitate the application of model merging strategies. MergeKit offers an extensible framework to efficiently merge models on any hardware, providing utility to researchers and practitioners. To date, thousands of models have been merged by the open-source community, leading to the creation of some of the worlds most powerful open-source model checkpoints, as assessed by the Open LLM Leaderboard. The library is accessible at https://github.com/arcee-ai/MergeKit.
- Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836.
- Llm augmented llms: Expanding capabilities through composition. arXiv preprint arXiv:2401.02412.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
- Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530.
- Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044.
- Think you have solved question answering? try arc, the ai2 reasoning challenge.
- Training verifiers to solve math word problems.
- Kyle Corbitt. 2023. How we built “mistral 7b fine-tune optimized,” the best 7b model for fine-tuning.
- MohammadReza Davari and Eugene Belilovsky. 2023. Model breadcrumbs: Scaling multi-task model merging with sparse masks. arXiv preprint arXiv:2312.06795.
- A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385.
- The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296.
- Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31.
- Measuring massive multitask language understanding.
- Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
- Transformer fusion with optimal transport. arXiv preprint arXiv:2310.05719.
- Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
- PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. Association for Computational Linguistics.
- Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849.
- Repair: Renormalizing permuted activations for interpolation repair. arXiv preprint arXiv:2211.08403.
- Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373.
- Truthfulqa: Measuring how models mimic human falsehoods.
- Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716.
- Large language models: A survey. arXiv preprint arXiv:2402.06196.
- Vaishnavh Nagarajan and J Zico Kolter. 2019. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32.
- What is being transferred in transfer learning? arXiv preprint arXiv:2008.11687.
- Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–260. PMLR.
- Daniel Park. 2023. Open-llm-leaderboard-report.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Model ratatouille: Recycling diverse models for out-of-distribution generalization. In International Conference on Machine Learning, pages 28656–28679. PMLR.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- WINOGRANDE: an adversarial winograd schema challenge at scale.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
- Ken Shoemake. 1985. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pages 245–254.
- Sidak Pal Singh and Martin Jaggi. 2020. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055.
- Joshua Smith and Michael Gashler. 2017. An investigation of how neural networks learn from the experiences of peers through periodic weight averaging. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 731–736. IEEE.
- Zipit! merging models from different tasks without training. arXiv preprint arXiv:2305.03053.
- Optimizing mode connectivity via neuron alignment. Advances in Neural Information Processing Systems, 33:15300–15311.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Joachim Utans. 1996. Weight averaging for neural networks and local resampling schemes. In Proc. AAAI-96 Workshop on Integrating Multiple Learned Models. AAAI Press, pages 133–138. Citeseer.
- Neha Verma and Maha Elbayad. 2024. Merging text transformer models from different initializations. arXiv preprint arXiv:2403.00986.
- Knowledge fusion of large language models. arXiv preprint arXiv:2401.10491.
- Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
- Opdai at semeval-2024 task 6: Small llms can accelerate hallucination detection with weakly supervised data. arXiv preprint arXiv:2402.12913.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR.
- Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454.
- Llama pro: Progressive llama with block expansion. arXiv preprint arXiv:2401.02415.
- Resolving interference when merging models. arXiv preprint arXiv:2306.01708.
- Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems, 36.
- Language models are super mario: Absorbing abilities from homologous models as a free lunch. arXiv preprint arXiv:2311.03099.
- Hellaswag: Can a machine really finish your sentence?
- Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
- A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76.