- The paper introduces a novel Bayesian framework that generates fast previews for selecting optimal task weights in multitask finetuning.
- It leverages flexible posterior approximations, such as Laplace and Gaussian mixtures, to balance data and mitigate negative transfer.
- Experimental results on vision transformers and LLMs confirm enhanced efficiency and preview precision with reduced computational cost.
Overview of Multitask Finetuning via Bayesian Model-Merging
The paper presents a novel approach to address the challenges inherent in multitask finetuning of large models. The increasing popularity of multitask finetuning, driven by the success of large pretrained models, necessitates an effective mechanism for weighing tasks to optimize performance. This process helps mitigate complications like data imbalance, task interference, and negative transfer. However, determining optimal weights is both difficult and computationally expensive. The authors propose leveraging Bayesian model-merging to generate fast previews, enabling more rapid exploration of the weighting space without exhaustive computation.
Key Contributions
The central contribution of this research lies in developing a Bayesian framework to design merging strategies via more flexible posterior distributions. Unlike previous methods that focus exclusively on selecting the best-performing weights, this method allows for a broader exploration of weights, potentially improving the quality of previews. The authors validate their approach across various benchmarks for both vision and natural-language transformers, demonstrating that flexible posteriors facilitate better previews and, consequently, enable more accurate multitask finetuning.
Methodology
The authors propose generating fast previews of performance across a range of task weights by reusing and averaging parameters of models trained on each task separately. A Bayesian approach is adopted to enhance preview quality by leveraging Gaussian and exponential-family posteriors to form new merging strategies.
Three methodologies are presented:
- ADAM W-SG: Utilizing AdamW to derive Laplace posteriors and compute previews using Hessian-weighted merging.
- IVON-HESS: Applying variational learning through IVON, requiring fewer data passes, and yielding better results in computational efficiency.
- MULTI IVON-HESS: Expanding upon IVON by including multiple runs for enhanced posterior approximation through mixtures of Gaussians.
Experimental Validation
The authors demonstrate the efficacy of their method with experiments on multiple neural architectures including ResNets, Vision Transformers, and LLMs like GEMMA-2B. Results consistently indicate that using more expressive posteriors lead to previews that closely match multitask finetuned models. Moreover, more complex posterior forms, such as mixtures of Gaussians, can further refine the preview quality, albeit at greater computational cost.
Implications and Future Directions
This work implies significant potential to improve computer resource efficiency in training large models with multiple tasks. By providing a means to quickly assess weighting strategies, it steers the optimization process effectively, saving both time and computational resources. The proposed approach can standardize efficient multitask weighting across various applications in machine learning, from language translation to image classification.
Looking forward, there is an opportunity to reduce the cost associated with highly expressive posterior approximations like mixture-based approaches, particularly in scenarios involving very large models. Additionally, expanding this approach to incorporate more complex task interactions and introducing adaptive weighting strategies that could dynamically adjust based on intermediate task performance are promising avenues for further research.