Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evolutionary Optimization of Model Merging Recipes (2403.13187v1)

Published 19 Mar 2024 in cs.NE

Abstract: We present a novel application of evolutionary algorithms to automate the creation of powerful foundation models. While model merging has emerged as a promising approach for LLM development due to its cost-effectiveness, it currently relies on human intuition and domain knowledge, limiting its potential. Here, we propose an evolutionary approach that overcomes this limitation by automatically discovering effective combinations of diverse open-source models, harnessing their collective intelligence without requiring extensive additional training data or compute. Our approach operates in both parameter space and data flow space, allowing for optimization beyond just the weights of the individual models. This approach even facilitates cross-domain merging, generating models like a Japanese LLM with Math reasoning capabilities. Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks. Furthermore, a culturally-aware Japanese VLM generated through our approach demonstrates its effectiveness in describing Japanese culture-specific content, outperforming previous Japanese VLMs. This work not only contributes new state-of-the-art models back to the open-source community, but also introduces a new paradigm for automated model composition, paving the way for exploring alternative, efficient approaches to foundation model development.

Automated Foundation Model Development through Evolutionary Optimization

Introduction to Evolutionary Model Merging

The development landscape for LLMs has been significantly energized by the advent of model merging techniques. These methodologies amalgamate the capabilities of multiple pre-existing models to forge a composite model that encompasses the strengths of its constituents. This paradigm of model development promises cost-effectiveness by circumventing the need for additional, resource-intensive training phases. However, the effectiveness of model merging hinges on the selection of appropriate source models and their integration strategies—tasks traditionally reliant on human expertise and intuition.

In contrast to this heuristic-based approach, the paper proposes a systematic, evolutionary algorithm-based method for model merging. This method automates the discovery of optimal merging configurations, both in the parameter space and data flow space, to yield foundation models with bespoke capabilities. Through evolutionary optimization, this work transcends the limitations of human intuition, unearthing novel, efficient pathways for model composition that can adaptively harness the distributed intelligence of existing models.

Key Contributions

The paper delineates several pivotal contributions to the domain of foundation model development:

  • Automated Model Composition: It presents an evolutionary framework for automatically generating new foundation models through the merger of diverse open-source models. By navigating the combinatorial space in a structured manner, it unlocks the potential to create high-performance foundation models without necessitating extensive additional computational resources.
  • Cross-Domain Merging Proficiency: The framework demonstrates a capacity for merging models across disparate domains (e.g., language and mathematics, language and vision), resulting in composite models with enhanced, cross-functional capabilities.
  • Benchmark Performance: Application of this method yielded models — a Japanese LLM with Math reasoning capability and a culturally-aware Japanese Vision-LLM (VLM) — that established new benchmarks on various evaluation tasks, underscoring the method's efficacy.
  • Generalization Capability: Notably, a 7B parameter LLM surpassed the performance of models with an order of magnitude more parameters on numerous Japanese LLM benchmarks, signaling the approach's exceptional efficiency and generalization ability.
  • Impact on Open-Source Community: By contributing state-of-the-art models back to the community, this work not only enhances the public repository of AI tools but also sets a new precedent for collaborative model development.

Evolutionary Optimization: Beyond Intuition in Model Merging

The crux of evolutionary optimization in model merging lies in its dual exploration of parameter space (adjusting model weights) and data flow space (orchestrating the flow of information through model layers). This bifurcated approach permits a comprehensive reconfiguration of model architecture beyond mere weight adjustments, enabling the construction of more potent composite models. The evolutionary process iteratively refines layer assignments and weight configurations, guided by performance metrics specific to the target tasks, gradually converging towards an optimal model architecture.

Implications and Future Directions

Envisaging the future trajectory of AI development, the automation of model merging posits a significant shift towards more resource-efficient methodologies. By paving the way for the speedy generation of specialized foundation models from an expansive pool of pre-trained models, evolutionary optimization positions itself as a linchpin in the drive towards democratized access to cutting-edge AI technologies. Moreover, the concept of cross-domain model merging, facilitated by evolutionary techniques, hints at the untapped potential for creating highly versatile models that transcend conventional domain boundaries.

As we progress, the exploration of evolutionary optimization will invariably extend to other facets of model development, including model parameter selection from a wider pool and the evolution of model swarms with niche capabilities. These advancements herald a new era of AI research, characterized by collaborative, community-driven efforts that leverage collective intelligence to address complex, multifaceted challenges.

In reflection, while the demonstrated approach marks a significant advancement in automated model development, challenges remain, particularly in mitigating logical inconsistencies and ensuring factual accuracy in generated content. Nonetheless, the foundation laid by this work illuminates the path towards a future wherein the evolution of AI is propelled by the synergistic amalgamation of diverse models, fostering a landscape of innovation and discovery.

Conclusion

In sum, this paper posits evolutionary optimization as a transformative tool in the field of LLM development, offering a robust, systematic alternative to intuition-driven model merging. By automating the fusion of diverse capabilities inherent in existing models, it encapsulates a forward-thinking approach to foundation model development, one that promises to accelerate the pace of innovation in AI. As this field continues to evolve, the principles of evolutionary optimization will undoubtedly play a pivotal role in shaping the future of automated, efficient, and collaborative AI research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Open AI. 2023. GPT-4V(ision) System Card. https://cdn.openai.com/papers/GPTV_System_Card.pdf
  2. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Anchorage, AK, USA) (KDD ’19). Association for Computing Machinery, New York, NY, USA, 2623–2631. https://doi.org/10.1145/3292500.3330701
  3. augmxnt. 2023. shisa-gamma-7b. HuggingFace. https://hf.co/augmxnt/shisa-gamma-7b-v1
  4. AUTOMATIC1111. 2022. Stable Diffusion WebUI. https://github.com/AUTOMATIC1111/stable-diffusion-webui.
  5. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond. arXiv:2308.12966 [cs.CV]
  6. Generative AI for Math: Abel. https://github.com/GAIR-NLP/abel.
  7. Training Verifiers to Solve Math Word Problems. CoRR abs/2110.14168 (2021). arXiv:2110.14168 https://arxiv.org/abs/2110.14168
  8. Model Merging by Uncertainty-Based Gradient Matching. In The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=D7KJmfEDQP
  9. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500 [cs.CV]
  10. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE transactions on evolutionary computation 6, 2 (2002), 182–197.
  11. Gintare Karolina Dziugaite and Daniel M Roy. 2017. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008 (2017).
  12. Adam Gaier and David Ha. 2019. Weight agnostic neural networks. Advances in neural information processing systems 32 (2019).
  13. A framework for few-shot language model evaluation. https://doi.org/10.5281/zenodo.10256836
  14. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. arXiv preprint arXiv:2203.14680 (2022).
  15. Charles O. Goddard. 2024. mergekit. https://github.com/arcee-ai/mergekit
  16. Hypernetworks. arXiv preprint arXiv:1609.09106 (2016).
  17. Nikolaus Hansen. 2006. The CMA evolution strategy: a comparing review. Towards a new evolutionary computation: Advances in the estimation of distribution algorithms (2006), 75–102.
  18. Sepp Hochreiter and Jürgen Schmidhuber. 1994. Simplifying neural nets by discovering flat minima. Advances in neural information processing systems 7 (1994).
  19. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Flat minima. Neural computation 9, 1 (1997), 1–42.
  20. HuggingFace. 2023. Open LLM Leaderboard. HuggingFace. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
  21. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089 (2022).
  22. Mistral 7B. arXiv:2310.06825 [cs.CL]
  23. FastText.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651 (2016).
  24. Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759 (2016).
  25. When do flat minima optimizers work? Advances in Neural Information Processing Systems 35 (2022), 16577–16595.
  26. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In International Conference on Learning Representations. https://openreview.net/forum?id=H1oyRlYgg
  27. Maxime Labonne. 2024a. Automerger Experiment. Tweet Thread (2024). https://twitter.com/maximelabonne/status/1767124527551549860
  28. Maxime Labonne. 2024b. Merge Large Language Models with mergekit. Hugging Face Blog (2024). https://huggingface.co/blog/mlabonne/merge-models
  29. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597 [cs.CV]
  30. Improved Baselines with Visual Instruction Tuning. arXiv:2310.03744 [cs.CV]
  31. LLaVA-NeXT: Improved reasoning, OCR, and world knowledge. https://llava-vl.github.io/blog/2024-01-30-llava-next/
  32. Visual Instruction Tuning. arXiv:2304.08485 [cs.CV]
  33. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. CoRR abs/2308.09583 (2023). https://doi.org/10.48550/ARXIV.2308.09583 arXiv:2308.09583
  34. Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems 35 (2022), 17703–17716.
  35. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems 35 (2022), 17359–17372.
  36. nostalgebraist. 2021. Interpreting GPT: The Logit Lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Accessed: 2024-03-08.
  37. Relative Flatness and Generalization. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (Eds.). https://openreview.net/forum?id=sygvo7ctb_
  38. Regularized evolution for image classifier architecture search. In Proceedings of the aaai conference on artificial intelligence, Vol. 33. 4780–4789.
  39. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  40. Jürgen Schmidhuber. 1992. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation 4, 1 (1992), 131–139.
  41. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=fR3wGCk-IXp
  42. Visual Question Answering Dataset for Bilingual Image Understanding: A Study of Cross-Lingual Transfer Using Attention Maps. In Proceedings of the 27th International Conference on Computational Linguistics (Santa Fe, New Mexico, USA). Association for Computational Linguistics, 1918–1928. http://aclweb.org/anthology/C18-1163
  43. Makoto Shing and Takuya Akiba. 2023. Japanese Stable VLM. https://huggingface.co/stabilityai/japanese-stable-vlm
  44. The evolved transformer. In International conference on machine learning. PMLR, 5877–5886.
  45. Kenneth O Stanley and Risto Miikkulainen. 2002. Evolving neural networks through augmenting topologies. Evolutionary computation 10, 2 (2002), 99–127.
  46. EvoJAX: Hardware-Accelerated Neuroevolution. arXiv preprint arXiv:2202.05008 (2022).
  47. Tom White. 2016. Sampling generative networks. arXiv preprint arXiv:1609.04468 (2016).
  48. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning. PMLR, 23965–23998.
  49. TIES-Merging: Resolving Interference When Merging Models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (Eds.). http://papers.nips.cc/paper_files/paper/2023/hash/1644c9af28ab7916874f6fd6228a9bcf-Abstract-Conference.html
  50. Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch. arXiv:2311.03099 [cs.CL]
  51. Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Takuya Akiba (22 papers)
  2. Makoto Shing (7 papers)
  3. Yujin Tang (31 papers)
  4. Qi Sun (114 papers)
  5. David Ha (30 papers)
Citations (60)
Youtube Logo Streamline Icon: https://streamlinehq.com