Merging by Matching Models in Task Parameter Subspaces (2312.04339v2)
Abstract: Model merging aims to cheaply combine individual task-specific models into a single multitask model. In this work, we view past merging methods as leveraging different notions of a ''task parameter subspace'' in which models are matched before being merged. We connect the task parameter subspace of a given model to its loss landscape and formalize how this approach to model merging can be seen as solving a linear system of equations. While past work has generally been limited to linear systems that have a closed-form solution, we consider using the conjugate gradient method to find a solution. We show that using the conjugate gradient method can outperform closed-form solutions, enables merging via linear systems that are otherwise intractable to solve, and flexibly allows choosing from a wide variety of initializations and estimates for the ''task parameter subspace''. We ultimately demonstrate that our merging framework called ''Matching Models in their Task Parameter Subspace'' (MaTS) achieves state-of-the-art results in multitask and intermediate-task model merging. We release all of the code and checkpoints used in our work at https://github.com/r-three/mats.
- Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022.
- Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
- Distributed second-order optimization using kronecker-factored approximations. International Conference on Learning Representations, 2017.
- Promptsource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279, 2022.
- Loss surface simplexes for mode connecting volumes and fast ensembling. In International Conference on Machine Learning, pp. 769–779. PMLR, 2021.
- Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
- Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044, 2022.
- Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3606–3613, 2014.
- Wikipassageqa: A benchmark collection for research on non-factoid answer passage retrieval. In The 41st international ACM SIGIR conference on research & development in information retrieval, pp. 1165–1168, 2018.
- The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pp. 177–190. Springer, 2005.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- ColD fusion: Collaborative descent for distributed multitask finetuning. arXiv preprint arXiv:2212.01378, 2022.
- Essentially no barriers in neural network energy landscape. In International conference on machine learning, pp. 1309–1318. PMLR, 2018.
- The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296, 2021.
- Emergent properties of the local geometry of neural loss landscapes. arXiv preprint arXiv:1910.05929, 2019.
- Large scale structure of neural network loss landscapes. Advances in Neural Information Processing Systems, 32, 2019.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. PMLR, 2020.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
- Fast approximate natural gradient descent in a kronecker factored eigenbasis. Advances in Neural Information Processing Systems, 31, 2018.
- Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014.
- Roger Grosse. Chapter 2: Taylor approximations. lecture notes. In University of Toronto CSC2541, Topics in Machine Learning: Neural Net Training Dynamics, 2022a. URL https://www.cs.toronto.edu/~rgrosse/courses/csc2541_2022/readings/L02_Taylor_approximations.pdf.
- Roger Grosse. Chapter 3: Metrics. lecture notes. In University of Toronto CSC2541, Topics in Machine Learning: Neural Net Training Dynamics, 2022b. URL https://www.cs.toronto.edu/~rgrosse/courses/csc2541_2022/readings/L03_metrics.pdf.
- A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pp. 573–582. PMLR, 2016.
- Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296, 2023.
- Knowledge is a region in weight space for fine-tuned language models. arXiv preprint arXiv:2302.04863, 2023.
- Ken Hayami. Convergence of the conjugate gradient method on singular systems. arXiv preprint arXiv:1809.00793, 2018.
- Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, pp. 204–207. IEEE, 2018.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
- Methods of conjugate gradients for solving linear systems 1. Journal of Research of the National Bureau of Standards, 49(6):409–436, 1952. Research Paper 2379.
- Flat Minima. Neural Computation, 9(1):1–42, 01 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.1.1. URL https://doi.org/10.1162/neco.1997.9.1.1.
- Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In International Joint Conference on Neural Networks, 2013.
- Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. arXiv preprint arXiv:1909.00277, 2019.
- Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
- Subspace inference for bayesian deep learning. In Uncertainty in Artificial Intelligence, pp. 1169–1179. PMLR, 2020.
- Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849, 2022.
- Repair: Renormalizing permuted activations for interpolation repair. arXiv preprint arXiv:2211.08403, 2022.
- Linear connectivity reveals generalization strategies. arXiv preprint arXiv:2205.12411, 2022.
- Qasc: A dataset for question answering via sentence composition. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
- 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
- Agustinus Kristiadi. Fisher information matrix. In Tech Blog, 2018. URL https://agustinus.kristia.de/techblog/2018/03/11/fisher-information/.
- Explaining landscape connectivity of low-cost solutions for multilayer nets. Advances in neural information processing systems, 32, 2019.
- Limitations of the empirical fisher approximation for natural gradient descent. Advances in neural information processing systems, 32, 2019.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- The winograd schema challenge. Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012.
- Reasoning over paragraph effects in situations. arXiv preprint arXiv:1908.05852, 2019.
- Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
- Mechanistic mode connectivity. In International Conference on Machine Learning, pp. 22965–23004. PMLR, 2023.
- Analyzing monotonic linear interpolation in neural network loss landscapes. arXiv preprint arXiv:2104.11044, 2021.
- A simple baseline for bayesian uncertainty in deep learning. Advances in neural information processing systems, 32, 2019.
- Rethinking parameter counting in deep models: Effective dimensionality revisited. arXiv preprint arXiv:2003.02139, 2020.
- James Martens. New insights and perspectives on the natural gradient method. The Journal of Machine Learning Research, 21(1):5776–5851, 2020.
- Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417. PMLR, 2015.
- Kronecker-factored curvature approximations for recurrent neural networks. International Conference on Learning Representations, 2018.
- Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
- Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, 2017.
- Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
- What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
- Task arithmetic in the tangent space: Improved editing of pre-trained models. arXiv preprint arXiv:2305.12827, 2023.
- Pipefisher: Efficient training of large language models using pipelining and fisher information matrices. Proceedings of Machine Learning and Systems, 5, 2023.
- Kaisa: an adaptive second-order optimizer framework for deep neural networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14, 2021.
- Re-basin via implicit sinkhorn differentiation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20237–20246, 2023.
- Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088, 2018.
- Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Intermediate-task transfer learning with pretrained models for natural language understanding: When and why does it work? arXiv preprint arXiv:2005.00628, 2020.
- Exploring mode connectivity for pre-trained language models. arXiv preprint arXiv:2210.14102, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Recycling diverse models for out-of-distribution generalization. arXiv preprint arXiv:2212.10445, 2022.
- Getting closer to ai complete question answering: A set of prerequisite real tasks. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), 2020.
- Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
- Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
- Tackling the story ending biases in the story cloze test. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 752–757, 2018.
- Pre-train your loss: Easy bayesian transfer learning with informative priors. Advances in Neural Information Processing Systems, 35:27706–27715, 2022.
- WoodFisher: Efficient second-order approximation for neural network compression. Advances in Neural Information Processing Systems, 2020.
- Sebastian U. Stich. Local SGD converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
- Zipit! merging models from different tasks without training. arXiv preprint arXiv:2305.03053, 2023.
- An empirical study of multimodal model merging. arXiv preprint arXiv:2304.14933 [cs.CV], 2023. doi: 10.48550/arXiv.2304.14933. URL https://arxiv.org/abs/2304.14933.
- Quartz: An open-domain dataset of qualitative relationship questions. arXiv preprint arXiv:1909.03553, 2019.
- Skfac: Training neural networks with faster kronecker-factored approximate curvature. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- Optimizing mode connectivity via neuron alignment. Advances in Neural Information Processing Systems, 33:15300–15311, 2020.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
- Learning neural network subspaces. In International Conference on Machine Learning, pp. 11217–11227. PMLR, 2021.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022a.
- Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7959–7971, 2022b.
- SUN database: Exploring a large collection of scene categories. International Journal of Computer Vision (IJCV), 119(1):3–22, August 2016.
- Resolving interference when merging models. arXiv preprint arXiv:2306.01708, 2023.
- LeCun Yann. The mnist database of handwritten digits. 1998. URL http://yann.lecun.com/exdb/mnist.
- Netzer Yuval. Reading digits in natural images with unsupervised feature learning. In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
- Paws: Paraphrase adversaries from word scrambling. arXiv preprint arXiv:1904.01130, 2019.
- Not all tasks are born equal: Understanding zero-shot generalization. The Eleventh International Conference on Learning Representations, 2022.