Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 25 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 92 tok/s Pro

Kimi K2 174 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4 39 tok/s Pro

2000 character limit reached

How to Weight Multitask Finetuning? Fast Previews via Bayesian Model-Merging (2412.08147v1)

Published 11 Dec 2024 in cs.LG, cs.AI, and stat.ML

Abstract: When finetuning multiple tasks altogether, it is important to carefully weigh them to get a good performance, but searching for good weights can be difficult and costly. Here, we propose to aid the search with fast previews to quickly get a rough idea of different reweighting options. We use model merging to create previews by simply reusing and averaging parameters of models trained on each task separately (no retraining required). To improve the quality of previews, we propose a Bayesian approach to design new merging strategies by using more flexible posteriors. We validate our findings on vision and natural-language transformers. Our work shows the benefits of model merging via Bayes to improve multitask finetuning.

Summary

The paper introduces a novel Bayesian framework that generates fast previews for selecting optimal task weights in multitask finetuning.
It leverages flexible posterior approximations, such as Laplace and Gaussian mixtures, to balance data and mitigate negative transfer.
Experimental results on vision transformers and LLMs confirm enhanced efficiency and preview precision with reduced computational cost.

Overview of Multitask Finetuning via Bayesian Model-Merging

The paper presents a novel approach to address the challenges inherent in multitask finetuning of large models. The increasing popularity of multitask finetuning, driven by the success of large pretrained models, necessitates an effective mechanism for weighing tasks to optimize performance. This process helps mitigate complications like data imbalance, task interference, and negative transfer. However, determining optimal weights is both difficult and computationally expensive. The authors propose leveraging Bayesian model-merging to generate fast previews, enabling more rapid exploration of the weighting space without exhaustive computation.

Key Contributions

The central contribution of this research lies in developing a Bayesian framework to design merging strategies via more flexible posterior distributions. Unlike previous methods that focus exclusively on selecting the best-performing weights, this method allows for a broader exploration of weights, potentially improving the quality of previews. The authors validate their approach across various benchmarks for both vision and natural-language transformers, demonstrating that flexible posteriors facilitate better previews and, consequently, enable more accurate multitask finetuning.

Methodology

The authors propose generating fast previews of performance across a range of task weights by reusing and averaging parameters of models trained on each task separately. A Bayesian approach is adopted to enhance preview quality by leveraging Gaussian and exponential-family posteriors to form new merging strategies.

Three methodologies are presented:

ADAM W-SG: Utilizing AdamW to derive Laplace posteriors and compute previews using Hessian-weighted merging.
IVON-HESS: Applying variational learning through IVON, requiring fewer data passes, and yielding better results in computational efficiency.
MULTI IVON-HESS: Expanding upon IVON by including multiple runs for enhanced posterior approximation through mixtures of Gaussians.

Experimental Validation

The authors demonstrate the efficacy of their method with experiments on multiple neural architectures including ResNets, Vision Transformers, and LLMs like GEMMA-2B. Results consistently indicate that using more expressive posteriors lead to previews that closely match multitask finetuned models. Moreover, more complex posterior forms, such as mixtures of Gaussians, can further refine the preview quality, albeit at greater computational cost.

Implications and Future Directions

This work implies significant potential to improve computer resource efficiency in training large models with multiple tasks. By providing a means to quickly assess weighting strategies, it steers the optimization process effectively, saving both time and computational resources. The proposed approach can standardize efficient multitask weighting across various applications in machine learning, from language translation to image classification.

Looking forward, there is an opportunity to reduce the cost associated with highly expressive posterior approximations like mixture-based approaches, particularly in scenarios involving very large models. Additionally, expanding this approach to incorporate more complex task interactions and introducing adaptive weighting strategies that could dynamically adjust based on intermediate task performance are promising avenues for further research.