Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Knowledge Fusion of Large Language Models (2401.10491v2)

Published 19 Jan 2024 in cs.CL
Knowledge Fusion of Large Language Models

Abstract: While training LLMs from scratch can generate models with distinct functionalities and strengths, it comes at significant costs and may result in redundant capabilities. Alternatively, a cost-effective and compelling approach is to merge existing pre-trained LLMs into a more potent model. However, due to the varying architectures of these LLMs, directly blending their weights is impractical. In this paper, we introduce the notion of knowledge fusion for LLMs, aimed at combining the capabilities of existing LLMs and transferring them into a single LLM. By leveraging the generative distributions of source LLMs, we externalize their collective knowledge and unique strengths, thereby potentially elevating the capabilities of the target model beyond those of any individual source LLM. We validate our approach using three popular LLMs with different architectures--Llama-2, MPT, and OpenLLaMA--across various benchmarks and tasks. Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation. Our code, model weights, and data are public at \url{https://github.com/fanqiwan/FuseLLM}.

Introduction

In the landscape of NLP, the development of LLMs represents a significant stride forward in the ability of machines to process and understand human language. The training of such models, albeit yielding powerful computational tools, demands substantial resources. The paper under discussion introduces an alternative to building these complex models from the ground up. The authors present an innovative technique known as knowledge fusion, which essentially merges the expertise of various pre-existing LLMs to produce an advanced and capable successor without the traditionally associated costs and environmental impact.

Methodology

The novel knowledge fusion strategy transcends traditional methods that typically require homogeneous model architectures or maintain multiple models in parallel. Instead, it harnesses the predictive power embedded in the generative distributions of various source LLMs. By focusing on the probabilistic outcomes these models generate, the authors can transfer the unique knowledge and strengths of each contributing LLM to a singular target LLM via a process of lightweight continual training. The amalgamation occurs not by blending raw model parameters but by aligning the token probabilities associated with specific text inputs.

Evaluation

The authors put their method to the test using three distinct LLMs: Llama-2, MPT, and OpenLLaMA. Across multiple tasks and benchmarks related to reasoning, commonsense understanding, and code generation, knowledge fusion displays a marked improvement in performance over individual source models and a basic ensemble baseline. Importantly, the improvements are not just quantitative; the fused model exhibits gains in a broad array of capabilities, hinting at a qualitative enhancement of the model's knowledge base.

Implications and Conclusions

Concluding their findings, the researchers underline the potency and promise of knowledge fusion in LLMs, noting it as a fertile area for future advancement. Their work demonstrates that the fused model surpasses the capability of its individual parts, suggesting that the collective wisdom of distinct models, when harnessed appropriately, can lead to a computational sum greater than its parts. The researchers provided a foundation for potentially cost-saving, environmentally friendlier, and sophisticated advancements in AI language processing, opening a door to a range of applications that can benefit from more intelligent and capable LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649, 2023.
  2. Ensemble of averages: Improving model selection and boosting performance in domain generalization. Advances in Neural Information Processing Systems, 35:8265–8277, 2022.
  3. A framework for the evaluation of code generation models, 2022.
  4. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
  5. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  6. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. arXiv preprint arXiv:2208.08227, 2022.
  7. Overview of the iwslt 2017 evaluation campaign. In Proceedings of the 14th International Workshop on Spoken Language Translation, pp.  2–14, 2017.
  8. Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34:22405–22418, 2021.
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  10. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  11. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2924–2936, 2019a.
  12. Bam! born-again multi-task networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  5931–5937, 2019b.
  13. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  14. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, April 2023.
  15. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  16. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2368–2378, 2019.
  17. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023.
  18. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  19. A framework for few-shot language model evaluation, sep 2021.
  20. Openllama: An open reproduction of llama, May 2023.
  21. Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023.
  22. Stochastic weight averaging in parallel: Large-batch training that generalizes well. International Conference on Learning Representations, 2020.
  23. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  24. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
  25. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269, 2023.
  26. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561, 2023.
  27. Tinybert: Distilling bert for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  4163–4174, 2020.
  28. Dataless knowledge fusion by merging weights of language models. In The Eleventh International Conference on Learning Representations, 2022.
  29. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1601–1611, 2017.
  30. No train no gain: Revisiting efficient training algorithms for transformer-based language models. arXiv preprint arXiv:2307.06440, 2023.
  31. Mergedistill: Merging language models using pre-trained distillation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  2874–2887, 2021.
  32. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317, 2023.
  33. Branch-train-merge: Embarrassingly parallel training of expert language models. In First Workshop on Interpolation Regularizers and Beyond at NeurIPS 2022, 2022.
  34. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  35. The weighted majority algorithm. Information and Computation, 108(2):212–261, 1994.
  36. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, 2023.
  37. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  2381–2391, 2018.
  38. Turning bayesian model averaging into bayesian model combination. In The 2011 International Joint Conference on Neural Networks, pp.  2657–2663. IEEE, 2011.
  39. The lambada dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1525–1534, 2016.
  40. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  41. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  42. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  43. Diverse weight averaging for out-of-distribution generalization. Advances in Neural Information Processing Systems, 35:10821–10836, 2022.
  44. Risks and benefits of large language models for the environment. Environmental Science & Technology, 57(9):3464–3466, 2023.
  45. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249, 2018.
  46. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  47. Understanding the effectiveness of early weight averaging for training large language models. arXiv preprint arXiv:2306.03241, 2023.
  48. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  49. Patient knowledge distillation for bert model compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4323–4332, 2019.
  50. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  51. Transcending scaling laws with 0.1% extra compute. arXiv preprint arXiv:2210.11399, 2022.
  52. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05.
  53. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  54. Well-read students learn better: On the importance of pre-training compact models. arXiv preprint arXiv:1908.08962, 2019.
  55. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022a.
  56. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020.
  57. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23, 2023a.
  58. Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models. arXiv preprint arXiv:2205.12410, 2022b.
  59. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023b.
  60. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Dyiemonstrations, pp.  38–45, 2020.
  61. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022.
  62. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  63. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800, 2019.
  64. Composing parameter-efficient modules with arithmetic operations. arXiv preprint arXiv:2306.14870, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Fanqi Wan (20 papers)
  2. Xinting Huang (36 papers)
  3. Deng Cai (181 papers)
  4. Xiaojun Quan (52 papers)
  5. Wei Bi (62 papers)
  6. Shuming Shi (126 papers)
Citations (39)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com