Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Arcee's MergeKit: A Toolkit for Merging Large Language Models (2403.13257v3)

Published 20 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The rapid expansion of the open-source LLM landscape presents an opportunity to merge the competencies of these model checkpoints by combining their parameters. Advances in transfer learning, the process of fine-tuning pretrained models for specific tasks, has resulted in the development of vast amounts of task-specific models, typically specialized in individual tasks and unable to utilize each other's strengths. Model merging facilitates the creation of multitask models without the need for additional training, offering a promising avenue for enhancing model performance and versatility. By preserving the intrinsic capabilities of the original models, model merging addresses complex challenges in AI - including the difficulties of catastrophic forgetting and multitask learning. To support this expanding area of research, we introduce MergeKit, a comprehensive, open-source library designed to facilitate the application of model merging strategies. MergeKit offers an extensible framework to efficiently merge models on any hardware, providing utility to researchers and practitioners. To date, thousands of models have been merged by the open-source community, leading to the creation of some of the worlds most powerful open-source model checkpoints, as assessed by the Open LLM Leaderboard. The library is accessible at https://github.com/arcee-ai/MergeKit.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836.
  2. Llm augmented llms: Expanding capabilities through composition. arXiv preprint arXiv:2401.02412.
  3. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  4. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
  5. Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530.
  6. Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044.
  7. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  8. Training verifiers to solve math word problems.
  9. Kyle Corbitt. 2023. How we built “mistral 7b fine-tune optimized,” the best 7b model for fine-tuning.
  10. MohammadReza Davari and Eugene Belilovsky. 2023. Model breadcrumbs: Scaling multi-task model merging with sparse masks. arXiv preprint arXiv:2312.06795.
  11. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385.
  12. The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296.
  13. Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5.
  14. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31.
  15. Measuring massive multitask language understanding.
  16. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
  17. Transformer fusion with optimal transport. arXiv preprint arXiv:2310.05719.
  18. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407.
  19. Mistral 7b. arXiv preprint arXiv:2310.06825.
  20. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
  21. PubMedQA: A dataset for biomedical research question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, Hong Kong, China. Association for Computational Linguistics.
  22. Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849.
  23. Repair: Renormalizing permuted activations for interpolation repair. arXiv preprint arXiv:2211.08403.
  24. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373.
  25. Truthfulqa: Measuring how models mimic human falsehoods.
  26. Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716.
  27. Large language models: A survey. arXiv preprint arXiv:2402.06196.
  28. Vaishnavh Nagarajan and J Zico Kolter. 2019. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32.
  29. What is being transferred in transfer learning? arXiv preprint arXiv:2008.11687.
  30. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–260. PMLR.
  31. Daniel Park. 2023. Open-llm-leaderboard-report.
  32. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
  33. Model ratatouille: Recycling diverse models for out-of-distribution generalization. In International Conference on Machine Learning, pages 28656–28679. PMLR.
  34. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  35. WINOGRANDE: an adversarial winograd schema challenge at scale.
  36. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
  37. Ken Shoemake. 1985. Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pages 245–254.
  38. Sidak Pal Singh and Martin Jaggi. 2020. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055.
  39. Joshua Smith and Michael Gashler. 2017. An investigation of how neural networks learn from the experiences of peers through periodic weight averaging. In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 731–736. IEEE.
  40. Zipit! merging models from different tasks without training. arXiv preprint arXiv:2305.03053.
  41. Optimizing mode connectivity via neuron alignment. Advances in Neural Information Processing Systems, 33:15300–15311.
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  43. Joachim Utans. 1996. Weight averaging for neural networks and local resampling schemes. In Proc. AAAI-96 Workshop on Integrating Multiple Learned Models. AAAI Press, pages 133–138. Citeseer.
  44. Neha Verma and Maha Elbayad. 2024. Merging text transformer models from different initializations. arXiv preprint arXiv:2403.00986.
  45. Knowledge fusion of large language models. arXiv preprint arXiv:2401.10491.
  46. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
  47. Opdai at semeval-2024 task 6: Small llms can accelerate hallucination detection with weakly supervised data. arXiv preprint arXiv:2402.12913.
  48. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  49. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR.
  50. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454.
  51. Llama pro: Progressive llama with block expansion. arXiv preprint arXiv:2401.02415.
  52. Resolving interference when merging models. arXiv preprint arXiv:2306.01708.
  53. Ties-merging: Resolving interference when merging models. Advances in Neural Information Processing Systems, 36.
  54. Language models are super mario: Absorbing abilities from homologous models as a free lunch. arXiv preprint arXiv:2311.03099.
  55. Hellaswag: Can a machine really finish your sentence?
  56. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
  57. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76.
Citations (49)

Summary

  • The paper introduces MergeKit as a toolkit that efficiently merges LLMs using linear interpolation, Task Arithmetic, and advanced techniques like SLERP.
  • MergeKit employs diverse methods for merging models with identical architectures, different initializations, and even distinct structures, boosting model versatility.
  • The toolkit’s open-source design, compatibility with frameworks like HuggingFace Transformers, and scalability enable practical integration of LLM capabilities.

A Comprehensive Overview of Arcee's MergeKit: Enhancing LLMs Through Model Merging

Introduction to MergeKit

In the rapidly evolving landscape of LLMs, the ability to effectively combine the competencies of various models stands as a significant development. The paper in discussion introduces MergeKit, an open-source toolkit designed for the merging of LLMs. This toolkit facilitates the integration of model parameters from different checkpoints, aiming to leverage the strengths of individual models and create enhanced, multitask models. The innovation of MergeKit lies in its ability to merge thousands of models efficiently, even on hardware with limited capabilities, thus broadening the scope for research and practical applications of LLMs.

Methodology and Implementation

Model Merging Strategies

MergeKit incorporates a diverse array of merging techniques, classified based on the similarities in architecture and initialization between the models being merged:

  • For models with identical architectures and initializations, MergeKit employs linear interpolation techniques and advanced strategies like Task Arithmetic and SLERP (Spherical Linear intERPolation). These methods do not necessitate additional training data or fine-tuning post-merging.
  • In cases where models have identical architectures but different initializations, MergeKit utilizes methods such as Git-Rebasin and Optimal Transport Fusion (OTFusion) to align and merge model weights effectively.

Additionally, the toolkit explores the fusion of models with different architectures through methods like CALM and FUSELLM, reflecting a broader ambition to integrate diverse LLM architectures seamlessly.

Practical Applications and Efficiency

The utility of MergeKit extends to various practical scenarios, evidenced by its role in the development of powerful, domain-specific models like BioMistral. The toolkit's design emphasizes flexibility, interoperability with existing frameworks like HuggingFace Transformers, and scalability, ensuring that model merging can be executed effectively across a range of computational environments.

Discussion and Analysis

The introduction of MergeKit marks a significant advancement in the field of LLMs, addressing critical challenges such as catastrophic forgetting and the limitations of multi-task learning. By streamlining the process of model merging, MergeKit not only enhances the performance and versatility of LLMs but also facilitates a more efficient use of computational resources. The toolkit's design principles, focusing on user-centricity, modularity, and community support, ensure that it remains accessible and adaptable to evolving research needs.

The paper provides a comprehensive evaluation of MergeKit's impact, highlighting its role in the creation of merged models that demonstrate superior performance across a range of benchmarks. These empirical results underscore the potential of model merging as a transformative approach to leveraging the capabilities of pre-existing LLMs, encouraging further exploration and innovation in this area.

Future Directions

The development of MergeKit represents a foundational step toward more sophisticated and effective model merging techniques. As the toolkit continues to evolve, there is a clear opportunity for the research community to contribute novel merging strategies and further refine the existing framework. The open-source nature of MergeKit ensures that it remains a collaborative project, open to contributions that can enhance its functionality and applicability.

Conclusion

MergeKit offers a promising solution to some of the inherent challenges in LLM research and application, providing a pathway to more comprehensive, efficient, and versatile LLMs. Its development reflects a significant contribution to the field, with potential ramifications that extend beyond the immediate horizon of current AI research. The continuing evolution of MergeKit and the exploration of new model merging techniques promise to further augment the capabilities of LLMs, paving the way for groundbreaking advancements in natural language processing and AI at large.