Dataless Knowledge Fusion by Merging Weights of Language Models (2212.09849v5)
Abstract: Fine-tuning pre-trained LLMs has become the prevalent paradigm for building downstream NLP models. Oftentimes fine-tuned models are readily available but their training data is not, due to data privacy or intellectual property concerns. This creates a barrier to fusing knowledge across individual models to yield a better single model. In this paper, we study the problem of merging individual models built on different training data sets to obtain a single model that performs well both across all data set domains and can generalize on out-of-domain data. We propose a dataless knowledge fusion method that merges models in their parameter space, guided by weights that minimize prediction differences between the merged model and the individual models. Over a battery of evaluation settings, we show that the proposed method significantly outperforms baselines such as Fisher-weighted averaging or model ensembling. Further, we find that our method is a promising alternative to multi-task learning that can preserve or sometimes improve over the individual models without access to the training data. Finally, model merging is more efficient than training a multi-task model, thus making it applicable to a wider set of scenarios.
- Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022.
- Emotions from text: machine learning for text-based emotion prediction. In Proceedings of human language technology conference and conference on empirical methods in natural language processing, pp. 579–586, 2005.
- Ensemble of averages: Improving model selection and boosting performance in domain generalization. arXiv preprint arXiv:2110.10832, 2021.
- Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.
- Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34:22405–22418, 2021.
- Fusing finetuned models for better pretraining. ArXiv, abs/2204.03044, 2022.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019.
- Automatically constructing a corpus of sentential paraphrases. In Third International Workshop on Paraphrasing (IWP2005), 2005.
- Essentially no barriers in neural network energy landscape. In International conference on machine learning, pp. 1309–1318. PMLR, 2018.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. PMLR, 2020.
- The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp. 1–9, 2007.
- Stochastic weight averaging in parallel: Large-batch training that generalizes well. International Conference on Learning Representations, 2020.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2021.
- Ontonotes: the 90% solution. In Proceedings of the human language technology conference of the NAACL, Companion Volume: Short Papers, pp. 57–60, 2006.
- Averaging weights leads to wider optima and better generalization. In UAI, 2018.
- Mergedistill: Merging language models using pre-trained distillation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2874–2887, 2021.
- Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306, 2022.
- On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2019.
- Dailydialog: A manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957, 2017.
- FedNLP: Benchmarking federated learning methods for natural language processing tasks. In Findings of the Association for Computational Linguistics: NAACL 2022, pp. 157–175, Seattle, United States, July 2022.
- Grounded emotions. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 477–483. IEEE, 2017.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Data-free knowledge distillation for deep neural networks. NIPS Workshop on Learning with Limited Data, 2017.
- Merging models with fisher-weighted averaging. arXiv preprint arXiv:2111.09832, 2021.
- Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR, 2017.
- Saif Mohammad. # emotional tweets. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pp. 246–255, 2012.
- Wassa-2017 shared task on emotion intensity. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 34–49, 2017.
- Sentiment, emotion, purpose, and style in electoral tweets. Information Processing & Management, 51(4):480–499, 2015.
- Zero-shot knowledge distillation in deep networks. In International Conference on Machine Learning, pp. 4743–4751. PMLR, 2019.
- What is being transferred in transfer learning? In Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 512–523, 2020.
- Model fusion of heterogeneous neural networks via cross-layer alignment. arXiv preprint arXiv:2110.15538, 2021.
- Laura Ana Maria Oberländer and Roman Klinger. An analysis of annotated corpora for emotion classification in text. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2104–2119, 2018.
- Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11:169–198, 1999.
- Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
- Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088, 2018.
- What to pre-train on? Efficient intermediate task selection. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 10585–10605, Online and Punta Cana, Dominican Republic, November 2021.
- Intermediate-task transfer learning with pretrained language models: When and why does it work? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5231–5247, Online, July 2020.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
- Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392, 2016.
- Diverse weight averaging for out-of-distribution generalization. ArXiv, abs/2205.09739, 2022.
- Temporally-informed analysis of named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7605–7617, Online, July 2020.
- Lior Rokach. Ensemble-based classifiers. Artificial intelligence review, 33(1):1–39, 2010.
- Erik Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147, 2003.
- Evidence for universality and cultural variation of differential emotion response patterning. Journal of personality and social psychology, 66(2):310, 1994.
- Annotation, modelling and analysis of fine-grained emotions on a stance and sentiment detection corpus. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 13–23, 2017.
- Model fusion via optimal transport. Advances in Neural Information Processing Systems, 33:22045–22055, 2020.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642, 2013.
- Semeval-2007 task 14: Affective text. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pp. 70–74, 2007.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, 2018.
- Federated learning with matched averaging. In International Conference on Learning Representations, 2020a.
- Multi-domain named entity recognition with genre-aware and agnostic inference. In Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 8476–8488, 2020b.
- Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models. arXiv preprint arXiv:2205.12410, 2022.
- Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019.
- When to use multi-task learning vs intermediate fine-tuning for pre-trained encoder transfer learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 272–282, Dublin, Ireland, May 2022.
- A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, 2018.
- Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771, 2019.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022.