Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Knowledge Fusion of Chat LLMs: A Preliminary Technical Report (2402.16107v6)

Published 25 Feb 2024 in cs.CL

Abstract: Recently, FuseLLM introduced the concept of knowledge fusion to transfer the collective knowledge of multiple structurally varied LLMs into a target LLM through lightweight continual training. In this report, we extend the scalability and flexibility of the FuseLLM framework to realize the fusion of chat LLMs, resulting in FusionChat. FusionChat comprises two main stages. Firstly, we undertake knowledge fusion for structurally and scale-varied source LLMs to derive multiple target LLMs of identical structure and size via lightweight fine-tuning. Then, these target LLMs are merged within the parameter space, wherein we propose a novel method for determining the merging weights based on the variation ratio of parameter matrices before and after fine-tuning. We validate our approach using three prominent chat LLMs with diverse architectures and scales, namely NH2-Mixtral-8x7B, NH2-Solar-10.7B, and OpenChat-3.5-7B. Experimental results spanning various chat domains demonstrate the superiority of FusionChat-7B across a broad spectrum of chat LLMs at 7B and 34B scales, even surpassing GPT-3.5 (March) and approaching Mixtral-8x7B-Instruct.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649.
  2. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  4. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  5. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
  6. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543.
  7. Stochastic weight averaging in parallel: Large-batch training that generalizes well. International Conference on Learning Representations.
  8. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  9. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  10. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
  11. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  12. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561.
  13. Tinybert: Distilling bert for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4163–4174.
  14. Dataless knowledge fusion by merging weights of language models. In The Eleventh International Conference on Learning Representations.
  15. Mergedistill: Merging language models using pre-trained distillation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2874–2887.
  16. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166.
  17. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
  18. The weighted majority algorithm. Information and Computation, 108(2):212–261.
  19. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  20. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  21. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716.
  22. Turning bayesian model averaging into bayesian model combination. In The 2011 International Joint Conference on Neural Networks, pages 2657–2663. IEEE.
  23. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  24. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  25. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249.
  26. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  27. Shoemake, K. (1985). Animating rotation with quaternion curves. In Proceedings of the 12th annual conference on Computer graphics and interactive techniques, pages 245–254.
  28. Patient knowledge distillation for bert model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4323–4332.
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  30. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962.
  31. Knowledge fusion of large language models. arXiv preprint arXiv:2401.10491.
  32. Explore-instruct: Enhancing domain-specific instruction coverage through active exploration. arXiv preprint arXiv:2310.09168.
  33. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
  34. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788.
  35. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120.
  36. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Dyiemonstrations, pages 38–45.
  37. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pages 23965–23998. PMLR.
  38. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  39. Ties-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems.
  40. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  41. Language models are super mario: Absorbing abilities from homologous models as a free lunch. arXiv preprint arXiv:2311.03099.
  42. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Fanqi Wan (20 papers)
  2. Ziyi Yang (77 papers)
  3. Longguang Zhong (8 papers)
  4. Xiaojun Quan (52 papers)
  5. Xinting Huang (36 papers)
  6. Wei Bi (62 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com