FuseChat: Knowledge Fusion of Chat Models (2408.07990v1)

Published 15 Aug 2024 in cs.CL

Abstract: While training LLMs from scratch can indeed lead to models with distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in competencies. Knowledge fusion aims to integrate existing LLMs of diverse architectures and capabilities into a more potent LLM through lightweight continual training, thereby reducing the need for costly LLM development. In this work, we propose a new framework for the knowledge fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we conduct pairwise knowledge fusion on source chat LLMs of varying structures and scales to create multiple target LLMs with identical structure and size via lightweight fine-tuning. During this process, a statistics-based token alignment approach is introduced as the cornerstone for fusing LLMs with different structures. Secondly, we merge these target LLMs within the parameter space, where we propose a novel method for determining the merging coefficients based on the magnitude of parameter updates before and after fine-tuning. We implement and validate FuseChat using six prominent chat LLMs with diverse architectures and scales, including OpenChat-3.5-7B, Starling-LM-7B-alpha, NH2-SOLAR-10.7B, InternLM2-Chat-20B, Mixtral-8x7B-Instruct, and Qwen-1.5-Chat-72B. Experimental results on two instruction-following benchmarks, AlpacaEval 2.0 and MT-Bench, demonstrate the superiority of FuseChat-7B over baselines of various sizes. Our model is even comparable to the larger Mixtral-8x7B-Instruct and approaches GPT-3.5-Turbo-1106 on MT-Bench. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/FuseAI}.

PDF HTML Abstract

An Expert Overview of "FuseChat: Knowledge Fusion of Chat Models"

"Knowledge Fusion of Chat Models," authored by Fanqi Wan, Longguang Zhong, Ziyi Yang, Ruijun Chen, and Xiaojun Quan, introduces FuseChat, an innovative framework for integrating multiple LLMs with varied architectures and capabilities into a single, more potent LLM. The framework aims to reduce the resource-intensive nature of training LLMs from scratch by leveraging the strengths and complementary knowledge of existing models through lightweight continual training.

Methodology

The FuseChat framework is composed of two principal stages:

Pairwise Knowledge Fusion:
- The process begins by selecting a pivot LLM from the available models.
- Token alignment between different models is performed using a statistics-based approach to account for the diverse tokenization schemes.
- Knowledge fusion is then conducted between the pivot LLM and each of the remaining models, generating multiple target LLMs with identical structure and size.
Model Merging:
- The target LLMs are merged within the parameter space using a novel method termed SCE, which calculates merging coefficients based on the magnitude of parameter updates before and after fine-tuning.
- The SCE approach involves selecting salient parameters, calculating their importance, and erasing minority parameter directions to efficiently combine the advantages of the various models.

Experimental Results

The efficacy of FuseChat is demonstrated through extensive experimentation using six prominent chat LLMs, including OpenChat-3.5-7B, Starling-LM-7B-alpha, and InternLM2-Chat-20B. The evaluation is conducted on two instruction-following benchmarks: AlpacaEval 2.0 and MT-Bench.

Key Findings:

Performance Gains: FuseChat-7B outperforms the baseline models, including the individual source LLMs, notably achieving performance close to larger models such as Mixtral-8x7B-Instruct and almost reaching GPT-3.5-Turbo-1106 levels on MT-Bench.
Scalability: The framework exhibits scalable performance improvements with the inclusion of diverse and high-quality training data, as well as the incorporation of different numbers of target LLMs.
Token Alignment: The proposed token alignment strategy using mapping statistics (MS) significantly enhances the effectiveness of knowledge fusion compared to previous methods like exact matching (EM) and minimum edit distance (MinED).

Theoretical and Practical Implications

Theoretical Implications:

Knowledge Fusion Efficiency: FuseChat introduces a novel two-stage framework that efficiently amalgamates knowledge from multiple models without requiring extensive computational resources typically associated with training a new model from scratch.
Token Alignment: The introduction of an enhanced token alignment strategy leveraging dataset-wide mapping statistics presents a more accurate and comprehensive way to handle diverse tokenization schemes among models.

Practical Implications:

Resource Optimization: By optimizing the fusion of existing models, FuseChat significantly lowers the redundancy and cost of developing high-performing LLMs, making it more feasible for organizations with limited resources.
Performance Enhancement in Real-World Applications: The ability to merge multiple chat LLMs with diverse strengths into a single model enhances the practical applicability of LLMs in various domains, where nuanced and domain-specific knowledge integration is crucial.

Future Directions

The authors acknowledge the labor-intensive nature of constructing a diverse and high-quality knowledge fusion dataset, suggesting ongoing and future research should focus on automated and scalable data synthesis techniques. Moreover, expanding the capabilities of FuseChat to address aspects beyond instruction-following, such as enhancing knowledge comprehension and reducing the propensity for generating hallucinations, represents key areas for future exploration.

Conclusion

FuseChat represents a significant advancement in the field of LLM knowledge fusion, offering an efficient, scalable, and practical methodology to combine the strengths of multiple chat models. By addressing both structural and functional diversity among LLMs through innovative techniques in token alignment and model merging, FuseChat successfully achieves a balance between computational efficiency and performance. The promising results and broad applications indicate a potential pathway for more sustainable and optimized development of advanced LLMs.