Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Merging Models with Fisher-Weighted Averaging (2111.09832v2)

Published 18 Nov 2021 in cs.LG

Abstract: Averaging the parameters of models that have the same architecture and initialization can provide a means of combining their respective capabilities. In this paper, we take the perspective that this "merging" operation can be seen as choosing parameters that approximately maximize the joint likelihood of the posteriors of the models' parameters. Computing a simple average of the models' parameters therefore corresponds to making an isotropic Gaussian approximation to their posteriors. We develop an alternative merging procedure based on the Laplace approximation where we approximate each model's posterior as a Gaussian distribution whose precision matrix corresponds to its Fisher information. We first show that our "Fisher merging" technique provides a performance boost in settings where simple parameter averaging is currently used -- specifically, robust fine-tuning and model ensembling. Then, we compare merging to standard gradient-based transfer learning and demonstrate that merging enables a fundamentally different method for transferring capabilities across models. Specifically, we show that Fisher merging is competitive with gradient-based transfer learning approaches (while being significantly cheaper) in intermediate-task training and domain-adaptive pre-training. We also show that our merging procedure makes it possible to combine models in previously unexplored ways. We release our code to facilitate future research into methods for merging models.

Citations (264)

Summary

  • The paper presents the Fisher merging methodology that weights parameter averaging with Fisher Information to improve merging fidelity.
  • It introduces a closed-form solution that reduces computational cost while offering statistically robust enhancements over isotropic averaging.
  • Empirical evaluations in ensembling, fine-tuning, and domain adaptation demonstrate competitive performance and broader applicability in model transfer.

Overview of "Merging Models with Fisher-Weighted Averaging"

The paper "Merging Models with Fisher-Weighted Averaging" by Matena and Raffel presents a novel approach to model merging, enhancing traditional parameter averaging through the incorporation of Fisher Information. This work repositions model merging by employing a statistical rationale, which involves optimizing the joint likelihood of posteriors approximated via Gaussian distributions. It argues that simple parameter averaging, commonly used in federated learning and ensembling, represents a baseline method where each model’s posterior is assumed isotropic. The introduction of the Fisher Information Matrix as a weighting mechanism refines this process, allowing for more nuanced merging that accounts for parameter sensitivities.

The authors suggest that merging is not just an operational necessity but a promising extension of model capabilities, potentially replacing or complementing gradient-based transfer learning methods. This approach is empirically evaluated across multiple settings including robust fine-tuning, model ensembling, and domain adaptation, providing a comprehensive picture of its efficacy.

Key Contributions

  • Fisher Merging Methodology: The paper introduces a method of parameter averaging weighted by the Fisher Information, claiming superior performance compared to isotropic merging in numerous experimental contexts.
  • Closed-Form Solution: A significant advantage of the Fisher merging technique is the closed-form solution, which simplifies computation and thus reduces computational costs compared to iterative methods.
  • Empirical Analysis: The method has been tested extensively with LLMs such as BERT and RoBERTa on benchmark datasets. The authors highlight that Fisher merging provides statistically comparable or often superior results at a reduced computational cost relative to traditional transfer learning techniques.
  • Hypothesis-Driven Postulate: The perspective that parameter averaging should maximize the joint likelihood of a model's posteriors effectively transforms the merging problem into a statistical optimization scenario. This insight opens avenues for exploring more advanced forms of model merging.

Experimental Highlights

The authors present experimental results that underscore the viability of Fisher merging in diverse scenarios:

  • Model Ensembling: Fisher merging demonstrates superiority over isotropic merging and achieves performance comparable with traditional prediction averaging ensembles, but with significantly reduced computational cost.
  • Robust Fine-Tuning: Applying it to improve robustness in neural networks, the experimental results indicate enhanced IID and OOD performance when models are merged based on a weighted Fisher approximation.
  • Intermediate-Task Transfer Learning: Experiments show that Fisher merging can rival traditional intermediate-task fine-tuning while operating under significantly lower computational demands.
  • Domain Adaptation: Successful implementation of Fisher merging in domain-adaptive pre-training reveals competitive performance boosts, opening new methods of leveraging model training in domain-specific datasets.

Implications and Future Prospects

The introduction of Fisher-weighted averaging as a merging technique has significant implications for the efficiency and scope of model reuse. It proposes a paradigm where model transfer can be achieved without iterative fine-tuning, reducing barriers to scaling model blend operations. Given its competitive performance with lower computational requirements, Fisher merging could democratize model development where resource constraints are critical.

Further research directions might consider more sophisticated approximations of the Fisher matrix or explore the applicability of other statistical model properties to enhance parameter merging. The paper lays the groundwork for these explorations while making substantial contributions to the methodological toolkit available to researchers in machine learning. Future work could also focus on expanding the application of Fisher merging beyond NLP and vision tasks investigated in the paper, evaluating its utility across other domains and model architectures.