Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Merging Multi-Task Models via Weight-Ensembling Mixture of Experts (2402.00433v2)

Published 1 Feb 2024 in cs.LG and cs.CV

Abstract: Merging various task-specific Transformer-based models trained on different tasks into a single unified model can execute all the tasks concurrently. Previous methods, exemplified by task arithmetic, have been proven to be both effective and scalable. Existing methods have primarily focused on seeking a static optimal solution within the original model parameter space. A notable challenge is mitigating the interference between parameters of different models, which can substantially deteriorate performance. In this paper, we propose to merge most of the parameters while upscaling the MLP of the Transformer layers to a weight-ensembling mixture of experts (MoE) module, which can dynamically integrate shared and task-specific knowledge based on the input, thereby providing a more flexible solution that can adapt to the specific needs of each instance. Our key insight is that by identifying and separating shared knowledge and task-specific knowledge, and then dynamically integrating them, we can mitigate the parameter interference problem to a great extent. We conduct the conventional multi-task model merging experiments and evaluate the generalization and robustness of our method. The results demonstrate the effectiveness of our method and provide a comprehensive understanding of our method. The code is available at https://github.com/tanganke/weight-ensembling_MoE

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Git Re-Basin: Merging Models modulo Permutation Symmetries, March 2023. URL http://arxiv.org/abs/2209.04836.
  2. Loss Surface Simplexes for Mode Connecting Volumes and Fast Ensembling, November 2021. URL http://arxiv.org/abs/2102.13042.
  3. Remote Sensing Image Scene Classification: Benchmark and State of the Art. Proceedings of the IEEE, 105(10):1865–1883, October 2017. ISSN 0018-9219, 1558-2256. doi: 10.1109/JPROC.2017.2675998. URL http://arxiv.org/abs/1703.00121. arXiv:1703.00121 [cs].
  4. Scaling Instruction-Finetuned Language Models, December 2022. URL http://arxiv.org/abs/2210.11416.
  5. Describing Textures in the Wild. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.  3606–3613, Columbus, OH, USA, June 2014. IEEE. ISBN 978-1-4799-5118-5. doi: 10.1109/CVPR.2014.461. URL https://ieeexplore.ieee.org/document/6909856.
  6. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. 2024.
  7. Topology and geometry of half-rectified network optimization: 5th International Conference on Learning Representations, ICLR 2017. 2017. URL http://www.scopus.com/inward/record.url?scp=85064823226&partnerID=8YFLogxK.
  8. Essentially No Barriers in Neural Network Energy Landscape, February 2019. URL http://arxiv.org/abs/1803.00885.
  9. The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks, July 2022. URL http://arxiv.org/abs/2110.06296.
  10. A Review of Sparse Expert Models in Deep Learning, September 2022a. URL http://arxiv.org/abs/2209.01667.
  11. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity, June 2022b. URL http://arxiv.org/abs/2101.03961.
  12. Linear Mode Connectivity and the Lottery Ticket Hypothesis, July 2020. URL http://arxiv.org/abs/1912.05671.
  13. Pruning Neural Networks at Initialization: Why are We Missing the Mark?, March 2021. URL http://arxiv.org/abs/2009.08576.
  14. Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs, October 2018. URL http://arxiv.org/abs/1802.10026.
  15. ZipIt! Merging Models from Different Tasks without Training, May 2023. URL http://arxiv.org/abs/2305.03053.
  16. Masked Autoencoders Are Scalable Vision Learners, December 2021. URL http://arxiv.org/abs/2111.06377.
  17. Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. In IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, pp.  204–207. IEEE, 2018.
  18. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations, March 2019. URL http://arxiv.org/abs/1903.12261.
  19. Distilling the Knowledge in a Neural Network, March 2015. URL http://arxiv.org/abs/1503.02531.
  20. Parameter-Efficient Transfer Learning for NLP, June 2019. URL http://arxiv.org/abs/1902.00751.
  21. LoRA: Low-Rank Adaptation of Large Language Models, October 2021. URL http://arxiv.org/abs/2106.09685.
  22. Editing Models with Task Arithmetic, March 2023. URL http://arxiv.org/abs/2212.04089.
  23. Averaging Weights Leads to Wider Optima and Better Generalization, February 2019. URL http://arxiv.org/abs/1803.05407.
  24. Adaptive Mixtures of Local Experts. Neural Computation, 3(1):79–87, March 1991. ISSN 0899-7667. doi: 10.1162/neco.1991.3.1.79. URL https://ieeexplore.ieee.org/document/6797059.
  25. Mixtral of Experts, January 2024. URL http://arxiv.org/abs/2401.04088.
  26. Dataless Knowledge Fusion by Merging Weights of Language Models, April 2023. URL http://arxiv.org/abs/2212.09849.
  27. Kaddour, J. Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging, October 2022. URL http://arxiv.org/abs/2209.14981.
  28. 3D Object Representations for Fine-Grained Categorization. In 2013 IEEE International Conference on Computer Vision Workshops, pp.  554–561, December 2013. doi: 10.1109/ICCVW.2013.77. URL https://ieeexplore.ieee.org/document/6755945.
  29. Merging Decision Transformers: Weight Averaging for Forming Multi-Task Policies, September 2023. URL http://arxiv.org/abs/2303.07551.
  30. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. ISSN 00189219. doi: 10.1109/5.726791. URL http://ieeexplore.ieee.org/document/726791/.
  31. BASE Layers: Simplifying Training of Large, Sparse Models. In Proceedings of the 38th International Conference on Machine Learning, pp.  6265–6274. PMLR, July 2021. URL https://proceedings.mlr.press/v139/lewis21a.html.
  32. Deep Model Fusion: A Survey, September 2023. URL http://arxiv.org/abs/2309.15698.
  33. Convergent Learning: Do different neural networks learn the same representations?, February 2016. URL http://arxiv.org/abs/1511.07543.
  34. A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts, March 2023. URL http://arxiv.org/abs/2303.15361.
  35. Deep Neural Network Fusion via Graph Matching with Applications to Model Ensemble and Federated Learning. In Proceedings of the 39th International Conference on Machine Learning, pp.  13857–13869. PMLR, June 2022. URL https://proceedings.mlr.press/v162/liu22k.html.
  36. Merging Models with Fisher-Weighted Averaging, August 2022. URL http://arxiv.org/abs/2111.09832.
  37. Bag of Tricks for Fully Test-Time Adaptation, October 2023. URL http://arxiv.org/abs/2310.02416.
  38. Uniform convergence may be unable to explain generalization in deep learning. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/hash/05e97c207235d63ceb1db43c60db7bbb-Abstract.html.
  39. Reading Digits in Natural Images with Unsupervised Feature Learning. 2021.
  40. Language Models are Unsupervised Multitask Learners. 1:9, 2019.
  41. Learning Transferable Visual Models From Natural Language Supervision, February 2021. URL http://arxiv.org/abs/2103.00020.
  42. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32:323–332, August 2012. ISSN 0893-6080. doi: 10.1016/j.neunet.2012.02.016. URL https://www.sciencedirect.com/science/article/pii/S0893608012000457.
  43. Improving Heterogeneous Model Reuse by Density Estimation. In Thirty-Second International Joint Conference on Artificial Intelligence, volume 4, pp.  4244–4252, August 2023a. doi: 10.24963/ijcai.2023/472. URL https://www.ijcai.org/proceedings/2023/472.
  44. Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion, December 2023b. URL http://arxiv.org/abs/2312.06173.
  45. Optimizing Mode Connectivity via Neuron Alignment. In Advances in Neural Information Processing Systems, volume 33, pp.  15300–15311. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/aecad42329922dfc97eee948606e1f8e-Abstract.html.
  46. Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, July 2022. URL http://arxiv.org/abs/2203.05482.
  47. π𝜋\piitalic_π-tuning: transferring multimodal foundation models with optimal multi-task interpolation. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  48. SUN database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.  3485–3492, San Francisco, CA, USA, June 2010. IEEE. ISBN 978-1-4244-6984-0. doi: 10.1109/CVPR.2010.5539970. URL http://ieeexplore.ieee.org/document/5539970/.
  49. Resolving Interference When Merging Models, June 2023. URL http://arxiv.org/abs/2306.01708.
  50. AdaMerging: Adaptive Model Merging for Multi-Task Learning, October 2023. URL http://arxiv.org/abs/2310.02575.
  51. Understanding Neural Networks Through Deep Visualization, June 2015. URL http://arxiv.org/abs/1506.06579.
  52. Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch, November 2023. URL http://arxiv.org/abs/2311.03099.
  53. On Convexity and Linear Mode Connectivity in Neural Networks. OPT2022: 14th Annual Workshop on Optimization for Machine Learning, 2022.
  54. Learn From Model Beyond Fine-Tuning: A Survey, October 2023. URL http://arxiv.org/abs/2310.08184.
  55. Mixture-of-Experts with Expert Choice Routing, October 2022. URL http://arxiv.org/abs/2202.09368.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Anke Tang (14 papers)
  2. Li Shen (362 papers)
  3. Yong Luo (117 papers)
  4. Nan Yin (33 papers)
  5. Lefei Zhang (64 papers)
  6. Dacheng Tao (826 papers)
Citations (20)