What Matters for Model Merging at Scale? (2410.03617v1)

Published 4 Oct 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Model merging aims to combine multiple expert models into a more capable single model, offering benefits such as reduced storage and serving costs, improved generalization, and support for decentralized model development. Despite its promise, previous studies have primarily focused on merging a few small models. This leaves many unanswered questions about the effect of scaling model size and how it interplays with other key factors -- like the base model quality and number of expert models -- , to affect the merged model's performance. This work systematically evaluates the utility of model merging at scale, examining the impact of these different factors. We experiment with merging fully fine-tuned models using 4 popular merging methods -- Averaging, Task~Arithmetic, Dare, and TIES -- across model sizes ranging from 1B-64B parameters and merging up to 8 different expert models. We evaluate the merged models on both held-in tasks, i.e., the expert's training tasks, and zero-shot generalization to unseen held-out tasks. Our experiments provide several new insights about model merging at scale and the interplay between different factors. First, we find that merging is more effective when experts are created from strong base models, i.e., models with good zero-shot performance. Second, larger models facilitate easier merging. Third merging consistently improves generalization capabilities. Notably, when merging 8 large expert models, the merged models often generalize better compared to the multitask trained models. Fourth, we can better merge more expert models when working with larger models. Fifth, different merging methods behave very similarly at larger scales. Overall, our findings shed light on some interesting properties of model merging while also highlighting some limitations. We hope that this study will serve as a reference point on large-scale merging for upcoming research.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that instruction-tuned base models are critical for efficient merging, leading to significantly improved zero-shot generalization.
The research finds that larger models merge expert models more seamlessly, often outperforming multitask-trained baselines.
The study shows that at scale, different merging methods converge in performance, suggesting that simpler strategies can be effectively deployed.

Insights into Large-Scale Model Merging

The paper "What Matters for Model Merging at Scale?" presents a rigorous exploration of model merging techniques applied to large-scale models, focusing on the interplay between model size, base model quality, merging methods, and the number of expert models. This paper evaluates merging processes ranging from 1B to 64B parameter models, presenting empirical assessments on both held-in and zero-shot generalization (held-out) tasks.

Core Findings

Base Model Strength: Merging is notably more effective when using instruction-tuned (IT) base models compared to pretrained ones. The paper highlights that models with stronger zero-shot performance exhibit better weight disentanglement, which aids in the merging process.
Model Size Impact: Larger models facilitate easier merging, consistently outperforming smaller counterparts in merging tasks. This suggests a promising path towards adaptive and modular post-training strategies, indicating a potential reduction in the efficacy gap between merged and multitask-trained models.
Generalization Performance: Merging significantly enhances zero-shot generalization capabilities. For strong base models, increased numbers of merged experts correlate with improved generalization, often surpassing multitask-trained baselines.
Number of Models: Larger models can effectively merge more expert models without notable performance loss. The paper finds that merging more than six large instruction-tuned experts frequently leads to better generalization than multitask counterparts.
Method Convergence: At larger scales, especially with robust base models, different merging methods yield similar results. This convergence suggests that simpler merging approaches might suffice when dealing with well-instructed, large models.

Methodological Approach

The research employs four prominent merging methods: Averaging, Task Arithmetic, TIES-Merging, and Dare-TIES. It assesses the performance of these methods using both held-in tasks, where constituent models are trained, and held-out tasks to evaluate zero-shot generalization. By normalizing across tasks and evaluating multiple seeds, the paper ensures robustness and fairness in comparisons.

The experimental design systematically investigates the effects across PaLM and its instruction-tuned variant, scaling model sizes across four dimensions and testing with varying numbers of expert models. Such a comprehensive setup allows for a detailed understanding of the factors impacting merging efficiency.

Implications and Future Directions

This paper acts as a strong reference point for understanding model merging at scale. Its findings underscore the importance of using strong base models and highlight the potential for merging to serve as an alternative to multitask training, particularly for large-scale LLMs. As models grow and more open-weight resources become available, practical and scalable merging techniques will become increasingly vital.

From a theoretical perspective, this work prompts further exploration into why larger models merge more effectively and how merging methods can be refined for even greater efficacy. Practically, the convergence of merging methods at scale suggests opportunities to streamline processes, reducing computational costs without sacrificing performance.

In summary, this paper enriches the understanding of model merging in high-parameter environments, providing actionable insights and laying the groundwork for future research in AI model personalization and generalization.