Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Merging by Matching Models in Task Parameter Subspaces (2312.04339v2)

Published 7 Dec 2023 in cs.LG and cs.CL

Abstract: Model merging aims to cheaply combine individual task-specific models into a single multitask model. In this work, we view past merging methods as leveraging different notions of a ''task parameter subspace'' in which models are matched before being merged. We connect the task parameter subspace of a given model to its loss landscape and formalize how this approach to model merging can be seen as solving a linear system of equations. While past work has generally been limited to linear systems that have a closed-form solution, we consider using the conjugate gradient method to find a solution. We show that using the conjugate gradient method can outperform closed-form solutions, enables merging via linear systems that are otherwise intractable to solve, and flexibly allows choosing from a wide variety of initializations and estimates for the ''task parameter subspace''. We ultimately demonstrate that our merging framework called ''Matching Models in their Task Parameter Subspace'' (MaTS) achieves state-of-the-art results in multitask and intermediate-task model merging. We release all of the code and checkpoints used in our work at https://github.com/r-three/mats.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (93)
  1. Git re-basin: Merging models modulo permutation symmetries. arXiv preprint arXiv:2209.04836, 2022.
  2. Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
  3. Distributed second-order optimization using kronecker-factored approximations. International Conference on Learning Representations, 2017.
  4. Promptsource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279, 2022.
  5. Loss surface simplexes for mode connecting volumes and fast ensembling. In International Conference on Machine Learning, pp.  769–779. PMLR, 2021.
  6. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
  7. Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044, 2022.
  8. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3606–3613, 2014.
  9. Wikipassageqa: A benchmark collection for research on non-factoid answer passage retrieval. In The 41st international ACM SIGIR conference on research & development in information retrieval, pp.  1165–1168, 2018.
  10. The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop, pp.  177–190. Springer, 2005.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. ColD fusion: Collaborative descent for distributed multitask finetuning. arXiv preprint arXiv:2212.01378, 2022.
  13. Essentially no barriers in neural network energy landscape. In International conference on machine learning, pp.  1309–1318. PMLR, 2018.
  14. The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296, 2021.
  15. Emergent properties of the local geometry of neural loss landscapes. arXiv preprint arXiv:1910.05929, 2019.
  16. Large scale structure of neural network loss landscapes. Advances in Neural Information Processing Systems, 32, 2019.
  17. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp.  3259–3269. PMLR, 2020.
  18. Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
  19. Fast approximate natural gradient descent in a kronecker factored eigenbasis. Advances in Neural Information Processing Systems, 31, 2018.
  20. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544, 2014.
  21. Roger Grosse. Chapter 2: Taylor approximations. lecture notes. In University of Toronto CSC2541, Topics in Machine Learning: Neural Net Training Dynamics, 2022a. URL https://www.cs.toronto.edu/~rgrosse/courses/csc2541_2022/readings/L02_Taylor_approximations.pdf.
  22. Roger Grosse. Chapter 3: Metrics. lecture notes. In University of Toronto CSC2541, Topics in Machine Learning: Neural Net Training Dynamics, 2022b. URL https://www.cs.toronto.edu/~rgrosse/courses/csc2541_2022/readings/L03_metrics.pdf.
  23. A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pp.  573–582. PMLR, 2016.
  24. Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296, 2023.
  25. Knowledge is a region in weight space for fine-tuned language models. arXiv preprint arXiv:2302.04863, 2023.
  26. Ken Hayami. Convergence of the conjugate gradient method on singular systems. arXiv preprint arXiv:1809.00793, 2018.
  27. Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, pp.  204–207. IEEE, 2018.
  28. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019.
  29. Methods of conjugate gradients for solving linear systems 1. Journal of Research of the National Bureau of Standards, 49(6):409–436, 1952. Research Paper 2379.
  30. Flat Minima. Neural Computation, 9(1):1–42, 01 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.1.1. URL https://doi.org/10.1162/neco.1997.9.1.1.
  31. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In International Joint Conference on Neural Networks, 2013.
  32. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. arXiv preprint arXiv:1909.00277, 2019.
  33. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089, 2022.
  34. Subspace inference for bayesian deep learning. In Uncertainty in Artificial Intelligence, pp.  1169–1179. PMLR, 2020.
  35. Dataless knowledge fusion by merging weights of language models. arXiv preprint arXiv:2212.09849, 2022.
  36. Repair: Renormalizing permuted activations for interpolation repair. arXiv preprint arXiv:2211.08403, 2022.
  37. Linear connectivity reveals generalization strategies. arXiv preprint arXiv:2205.12411, 2022.
  38. Qasc: A dataset for question answering via sentence composition. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
  39. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  40. Agustinus Kristiadi. Fisher information matrix. In Tech Blog, 2018. URL https://agustinus.kristia.de/techblog/2018/03/11/fisher-information/.
  41. Explaining landscape connectivity of low-cost solutions for multilayer nets. Advances in neural information processing systems, 32, 2019.
  42. Limitations of the empirical fisher approximation for natural gradient descent. Advances in neural information processing systems, 32, 2019.
  43. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  44. The winograd schema challenge. Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012.
  45. Reasoning over paragraph effects in situations. arXiv preprint arXiv:1908.05852, 2019.
  46. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
  47. Mechanistic mode connectivity. In International Conference on Machine Learning, pp.  22965–23004. PMLR, 2023.
  48. Analyzing monotonic linear interpolation in neural network loss landscapes. arXiv preprint arXiv:2104.11044, 2021.
  49. A simple baseline for bayesian uncertainty in deep learning. Advances in neural information processing systems, 32, 2019.
  50. Rethinking parameter counting in deep models: Effective dimensionality revisited. arXiv preprint arXiv:2003.02139, 2020.
  51. James Martens. New insights and perspectives on the natural gradient method. The Journal of Machine Learning Research, 21(1):5776–5851, 2020.
  52. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp.  2408–2417. PMLR, 2015.
  53. Kronecker-factored curvature approximations for recurrent neural networks. International Conference on Learning Representations, 2018.
  54. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
  55. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, 2017.
  56. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
  57. What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
  58. Task arithmetic in the tangent space: Improved editing of pre-trained models. arXiv preprint arXiv:2305.12827, 2023.
  59. Pipefisher: Efficient training of large language models using pipelining and fisher information matrices. Proceedings of Machine Learning and Systems, 5, 2023.
  60. Kaisa: an adaptive second-order optimizer framework for deep neural networks. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–14, 2021.
  61. Re-basin via implicit sinkhorn differentiation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  20237–20246, 2023.
  62. Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088, 2018.
  63. Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  64. Intermediate-task transfer learning with pretrained models for natural language understanding: When and why does it work? arXiv preprint arXiv:2005.00628, 2020.
  65. Exploring mode connectivity for pre-trained language models. arXiv preprint arXiv:2210.14102, 2022.
  66. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  67. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  68. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  69. Recycling diverse models for out-of-distribution generalization. arXiv preprint arXiv:2212.10445, 2022.
  70. Getting closer to ai complete question answering: A set of prerequisite real tasks. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), 2020.
  71. Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
  72. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
  73. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728, 2019.
  74. Tackling the story ending biases in the story cloze test. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  752–757, 2018.
  75. Pre-train your loss: Easy bayesian transfer learning with informative priors. Advances in Neural Information Processing Systems, 35:27706–27715, 2022.
  76. WoodFisher: Efficient second-order approximation for neural network compression. Advances in Neural Information Processing Systems, 2020.
  77. Sebastian U. Stich. Local SGD converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
  78. Zipit! merging models from different tasks without training. arXiv preprint arXiv:2305.03053, 2023.
  79. An empirical study of multimodal model merging. arXiv preprint arXiv:2304.14933 [cs.CV], 2023. doi: 10.48550/arXiv.2304.14933. URL https://arxiv.org/abs/2304.14933.
  80. Quartz: An open-domain dataset of qualitative relationship questions. arXiv preprint arXiv:1909.03553, 2019.
  81. Skfac: Training neural networks with faster kronecker-factored approximate curvature. IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
  82. Optimizing mode connectivity via neuron alignment. Advances in Neural Information Processing Systems, 33:15300–15311, 2020.
  83. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  84. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
  85. Learning neural network subspaces. In International Conference on Machine Learning, pp.  11217–11227. PMLR, 2021.
  86. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp.  23965–23998. PMLR, 2022a.
  87. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7959–7971, 2022b.
  88. SUN database: Exploring a large collection of scene categories. International Journal of Computer Vision (IJCV), 119(1):3–22, August 2016.
  89. Resolving interference when merging models. arXiv preprint arXiv:2306.01708, 2023.
  90. LeCun Yann. The mnist database of handwritten digits. 1998. URL http://yann.lecun.com/exdb/mnist.
  91. Netzer Yuval. Reading digits in natural images with unsupervised feature learning. In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  92. Paws: Paraphrase adversaries from word scrambling. arXiv preprint arXiv:1904.01130, 2019.
  93. Not all tasks are born equal: Understanding zero-shot generalization. The Eleventh International Conference on Learning Representations, 2022.
Citations (9)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub