MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

Published 3 Feb 2025 in cs.CL and cs.AI | (2502.00997v3)

Abstract: The recent success of specialized LLMs in domains such as mathematical reasoning and coding has led to growing interest in methods for merging these expert LLMs into a unified Mixture-of-Experts (MoE) model, with the goal of enhancing performance in each domain while retaining effectiveness on general tasks. However, the effective merging of expert models remains an open challenge, especially for models with highly divergent weight parameters or different architectures. State-of-the-art MoE merging methods only work with homogeneous model architectures and rely on simple unweighted averaging to merge expert layers, which does not address parameter interference and requires extensive fine-tuning of the merged MoE to restore performance. To address these limitations, this paper introduces new MoE merging techniques, including strategies to mitigate parameter interference, routing heuristics to reduce the need for MoE fine-tuning, and a novel method for merging experts with different architectures. Extensive experiments across multiple domains demonstrate the effectiveness of our proposed methods, reducing fine-tuning costs, improving performance over state-of-the-art methods, and expanding the applicability of MoE merging.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces novel merging techniques that reduce parameter interference by replacing traditional unweighted averaging with Dare and Ties methods.
It employs sequence-level routing heuristics based on perplexity to effectively select experts and minimize fine-tuning needs.
Empirical evaluations on benchmarks like GSM8K and MATH demonstrate significant performance improvements for both homogeneous and heterogeneous model merging.

MergeME: Model Merging Techniques for Homogeneous and Heterogeneous MoEs

The paper introduces innovative model merging techniques designed for enhancing the performance of Mixture-of-Experts (MoE) models by effectively merging homogeneous and heterogeneous LLMs. This is done by addressing parameter interference and exploring new methods for integrating experts with differing architectures.

Introduction to Merging MoE Models

The task of merging specialized LLMs into a unified MoE model aims at leveraging domain-specific expertise while maintaining general performance. Traditional methods of merging rely on unweighted averaging, which often faces parameter interference issues, requiring extensive fine-tuning post-merging. This paper presents methodologies that mitigate these issues and offer new techniques for heterogeneous expert integration.

Homogeneous Model Merging

The paper proposes advanced methods to replace averaging in the merging process to reduce parameter interference, specifically focusing on the use of Dare and Ties approaches. These methods trim task vectors to resolve conflict between parameters during merging. Furthermore, the authors introduce novel routing heuristics based on perplexity, which enable enhanced performance without extensive fine-tuning.

Figure 1: Overview of the proposed MoE framework for homogeneous model merging. We replace averaging with Dare or Ties merging to reduce parameter interference. Additionally, we introduce novel routing heuristics to enhance performance without fine-tuning.

Figure 2: Different types of parameter interference and merged outputs produced by simple averaging.

Methodologies and Routing Heuristics

The proposed methodologies for homogeneous model merging include sequence-level routing heuristics based on perplexity (PPL) of expert models. This heuristic selects the most confident experts for token sequences, reducing the need for fine-tuning. Additionally, the paper emphasizes separating non-Feedforward Network (FFN) layers to address inconsistencies in the merging process.

Heterogeneous Model Merging

For models with differing architectures, the paper proposes merging techniques that employ projectors and a novel routing network for sequence-level decision making. These innovations allow for the integration of diverse expert models into a cohesive MoE framework.

Figure 3: Overview of the proposed MoE framework for heterogeneous experts. Each color represents one heterogeneous expert. $n_1, \cdots, n_4$ refers to the number of layers in each expert.

Empirical Evaluation

Experiments across benchmarks like GSM8K, MBPP, and MATH demonstrate the efficacy of the proposed methods. The merged models show improved performance over traditional merging techniques, evidenced by consistent enhancement in task routing and quick adaptation during fine-tuning stages. The results confirm the effectiveness of the proposed innovations in both homogeneous and heterogeneous merging scenarios.

Figure 4: Similarity of task vector for attention and FFNs layers for Math and Code Expert experts. We average the similarity of attentions or FFNs in one decoder layers as the overall similarity for each layer.

Conclusions and Future Directions

The techniques introduced for MoE merging significantly reduce parameter interference and optimize routing decisions, enhancing overall performance across various datasets. Future work could explore the inclusion of load-balancing techniques and extend these approaches to multimodal domains involving vision and audio, fostering more robust and comprehensive MoE model architectures.

Markdown Report Issue