Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dataless Knowledge Fusion by Merging Weights of Language Models (2212.09849v5)

Published 19 Dec 2022 in cs.CL and cs.LG

Abstract: Fine-tuning pre-trained LLMs has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-of-domain data. We propose a dataless knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022.
  2. Emotions from text: machine learning for text-based emotion prediction. In Proceedings of human language technology conference and conference on empirical methods in natural language processing, pp. 579–586, 2005.
  3. Ensemble of averages: Improving model selection and boosting performance in domain generalization. arXiv preprint arXiv:2110.10832, 2021.
  4. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.
  5. Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34:22405–22418, 2021.
  6. Fusing finetuned models for better pretraining. ArXiv, abs/2204.03044, 2022.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019.
  8. Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005), 2005.
  9. Essentially no barriers in neural network energy landscape. In International conference on machine learning, pp. 1309–1318. PMLR, 2018.
  10. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. PMLR, 2020.
  11. The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp.  1–9, 2007.
  12. Stochastic weight averaging in parallel: Large-batch training that generalizes well. International Conference on Learning Representations, 2020.
  13. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2021.
  14. Ontonotes: the 90% solution. In Proceedings of the human language technology conference of the NAACL, Companion Volume: Short Papers, pp.  57–60, 2006.
  15. Averaging weights leads to wider optima and better generalization. In UAI, 2018.
  16. Mergedistill: Merging language models using pre-trained distillation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  2874–2887, 2021.
  17. Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306, 2022.
  18. On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2019.
  19. Dailydialog: A manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957, 2017.
  20. FedNLP: Benchmarking federated learning methods for natural language processing tasks. In Findings of the Association for Computational Linguistics: NAACL 2022, pp.  157–175, Seattle, United States, July 2022.
  21. Grounded emotions. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp.  477–483. IEEE, 2017.
  22. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  23. Data-free knowledge distillation for deep neural networks. NIPS Workshop on Learning with Limited Data, 2017.
  24. Merging models with fisher-weighted averaging. arXiv preprint arXiv:2111.09832, 2021.
  25. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp.  1273–1282. PMLR, 2017.
  26. Saif Mohammad. # emotional tweets. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pp.  246–255, 2012.
  27. Wassa-2017 shared task on emotion intensity. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp.  34–49, 2017.
  28. Sentiment, emotion, purpose, and style in electoral tweets. Information Processing & Management, 51(4):480–499, 2015.
  29. Zero-shot knowledge distillation in deep networks. In International Conference on Machine Learning, pp. 4743–4751. PMLR, 2019.
  30. What is being transferred in transfer learning? In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp.  512–523, 2020.
  31. Model fusion of heterogeneous neural networks via cross-layer alignment. arXiv preprint arXiv:2110.15538, 2021.
  32. Laura Ana Maria Oberländer and Roman Klinger. An analysis of annotated corpora for emotion classification in text. In Proceedings of the 27th International Conference on Computational Linguistics, pp.  2104–2119, 2018.
  33. Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11:169–198, 1999.
  34. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
  35. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088, 2018.
  36. What to pre-train on? Efficient intermediate task selection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  10585–10605, Online and Punta Cana, Dominican Republic, November 2021.
  37. Intermediate-task transfer learning with pretrained language models: When and why does it work? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5231–5247, Online, July 2020.
  38. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
  39. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  2383–2392, 2016.
  40. Diverse weight averaging for out-of-distribution generalization. ArXiv, abs/2205.09739, 2022.
  41. Temporally-informed analysis of named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7605–7617, Online, July 2020.
  42. Lior Rokach. Ensemble-based classifiers. Artificial intelligence review, 33(1):1–39, 2010.
  43. Erik Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp.  142–147, 2003.
  44. Evidence for universality and cultural variation of differential emotion response patterning. Journal of personality and social psychology, 66(2):310, 1994.
  45. Annotation, modelling and analysis of fine-grained emotions on a stance and sentiment detection corpus. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp.  13–23, 2017.
  46. Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020.
  47. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.  1631–1642, 2013.
  48. Semeval-2007 task 14: Affective text. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pp.  70–74, 2007.
  49. Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2018.
  50. Federated learning with matched averaging. In International Conference on Learning Representations, 2020a.
  51. Multi-domain named entity recognition with genre-aware and agnostic inference. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp.  8476–8488, 2020b.
  52. Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models. arXiv preprint arXiv:2205.12410, 2022.
  53. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019.
  54. When to use multi-task learning vs intermediate fine-tuning for pre-trained encoder transfer learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  272–282, Dublin, Ireland, May 2022.
  55. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, 2018.
  56. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771, 2019.
  57. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022.
Citations (148)

Summary

We haven't generated a summary for this paper yet.