Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging (2410.21804v1)
Abstract: Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer. Recent research on task arithmetic-based MTL demonstrates that merging the parameters of independently fine-tuned models can effectively achieve MTL. However, existing merging methods primarily seek a static optimal solution within the original model parameter space, which often results in performance degradation due to the inherent diversity among tasks and potential interferences. To address this challenge, in this paper, we propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging. Specifically, we first identify critical (or sensitive) modules by analyzing parameter variations in core modules of Transformer-based models before and after finetuning. Then, our WEMoE statically merges non-critical modules while transforming critical modules into a mixture-of-experts (MoE) structure. During inference, expert modules in the MoE are dynamically merged based on input samples, enabling a more flexible and adaptive merging approach. Building on WEMoE, we further introduce an efficient-and-effective WEMoE (E-WEMoE) method, whose core mechanism involves eliminating non-essential elements in the critical modules of WEMoE and implementing shared routing across multiple MoE modules, thereby significantly reducing both the trainable parameters, the overall parameter count, and computational overhead of the merged model by WEMoE. Experimental results across various architectures and tasks demonstrate that both WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
- A. Tang, L. Shen, Y. Luo, N. Yin, L. Zhang, and D. Tao, “Merging multi-task models via weight-ensembling mixture of experts,” ICML, 2024.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language Models are Unsupervised Multitask Learners,” vol. 1, 2019.
- K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” 2021.
- X. Wang, G. Chen, G. Qian, P. Gao, X. Wei, Y. Wang, Y. Tian, and W. Gao, “Large-scale multi-modal pre-trained models: A comprehensive survey,” Mach. Intell. Res., vol. 20, no. 4, pp. 447–482.
- H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei, “Scaling Instruction-Finetuned Language Models,” 2022.
- H. Zheng, L. Shen, A. Tang, Y. Luo, H. Hu, B. Du, and D. Tao, “Learn From Model Beyond Fine-Tuning: A Survey,” 2023.
- B. Cao, H. Lin, X. Han, and L. Sun, “The life cycle of knowledge in big language models: A survey,” Mach. Intell. Res., vol. 21, no. 2, pp. 217–238, 2024.
- W. Li, Y. Peng, M. Zhang, L. Ding, H. Hu, and L. Shen, “Deep Model Fusion: A Survey,” 2023.
- Y. Lin, Y. Gao, M. Gong, S. Zhang, Y. Zhang, and Z. Li, “Federated learning on multimodal data: A comprehensive survey,” Mach. Intell. Res., vol. 20, no. 4, pp. 539–553, 2023.
- P. Yadav, C. Raffel, M. Muqeeth, L. Caccia, H. Liu, T. Chen, M. Bansal, L. Choshen, and A. Sordoni, “A survey on model moerging: Recycling and routing among specialized experts for collaborative learning,” arXiv preprint arXiv:2408.07057, 2024.
- E. Yang, L. Shen, G. Guo, X. Wang, X. Cao, J. Zhang, and D. Tao, “Model merging in llms, mllms, and beyond: Methods, theories, applications and opportunities,” arXiv preprint arXiv:2408.07666, 2024.
- P. Izmailov, D. Podoprikhin, T. Garipov, D. Vetrov, and A. G. Wilson, “Averaging Weights Leads to Wider Optima and Better Generalization,” 2019.
- M. Wortsman, G. Ilharco, S. Y. Gadre, R. Roelofs, R. Gontijo-Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y. Carmon, S. Kornblith, and L. Schmidt, “Model soups: Averaging weights of multiple fine-tuned models improves accuracy without increasing inference time,” 2022.
- G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing Models with Task Arithmetic,” 2023.
- P. Yadav, D. Tam, L. Choshen, C. Raffel, and M. Bansal, “Resolving Interference When Merging Models,” 2023.
- G. Du, J. Lee, J. Li, R. Jiang, Y. Guo, S. Yu, H. Liu, S. K. Goh, H.-K. Tang, D. He, and M. Zhang, “Parameter competition balancing for model merging,” in NeurIPS, 2024.
- M. Matena and C. Raffel, “Merging Models with Fisher-Weighted Averaging,” 2022.
- X. Jin, X. Ren, D. Preotiuc-Pietro, and P. Cheng, “Dataless Knowledge Fusion by Merging Weights of Language Models,” 2023.
- E. Yang, Z. Wang, L. Shen, S. Liu, G. Guo, X. Wang, and D. Tao, “Adamerging: Adaptive model merging for multi-task learning,” ICLR, 2024.
- A. Tang, L. Shen, Y. Luo, L. Ding, H. Hu, B. Du, and D. Tao, “Concrete Subspace Learning based Interference Elimination for Multi-task Model Fusion,” 2023.
- L. Yu, B. Yu, H. Yu, F. Huang, and Y. Li, “Language models are super mario: Absorbing abilities from homologous models as a free lunch,” in ICML. PMLR, 2024.
- K. Wang, N. Dimitriadis, G. Ortiz-Jiménez, F. Fleuret, and P. Frossard, “Localizing task information for improved model merging and compression,” in ICML, 2024.
- C. Huang, P. Ye, T. Chen, T. He, X. Yue, and W. Ouyang, “Emr-merging: Tuning-free high-performance model merging,” NeurIPS, 2024.
- A. Vaswani, “Attention is all you need,” NeurIPS, 2017.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021.
- O. Sener and V. Koltun, “Multi-task learning as multi-objective optimization,” NeurIPS, vol. 31, 2018.
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” 2015.
- J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin, “Pruning Neural Networks at Initialization: Why are We Missing the Mark?” 2021.
- J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson, “Understanding Neural Networks Through Deep Visualization,” 2015.
- S. Mounsaveng, F. Chiaroni, M. Boudiaf, M. Pedersoli, and I. B. Ayed, “Bag of Tricks for Fully Test-Time Adaptation,” 2023.
- J. Liang, R. He, and T. Tan, “A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts,” 2023.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervision,” 2021.
- J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in CVPR. IEEE, 2010, pp. 3485–3492.
- J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3D Object Representations for Fine-Grained Categorization,” in ICCV Workshops, Dec. 2013, pp. 554–561.
- G. Cheng, J. Han, and X. Lu, “Remote Sensing Image Scene Classification: Benchmark and State of the Art,” Proceedings of the IEEE, vol. 105, no. 10, pp. 1865–1883, 2017.
- P. Helber, B. Bischke, A. Dengel, and D. Borth, “Introducing eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” in IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2018, pp. 204–207.
- Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading Digits in Natural Images with Unsupervised Feature Learning,” 2021.
- J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition,” Neural Networks, vol. 32, pp. 323–332, 2012.
- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, “Describing Textures in the Wild,” in CVPR. IEEE, 2014, pp. 3606–3613.
- D. Hendrycks and T. Dietterich, “Benchmarking Neural Network Robustness to Common Corruptions and Perturbations,” 2019.
- J. Kaddour, “Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging,” 2022.
- C. Wu, T. Wang, Y. Ge, Z. Lu, R. Zhou, Y. Shan, and P. Luo, “π𝜋\piitalic_π-tuning: transferring multimodal foundation models with optimal multi-task interpolation,” in ICML, ser. ICML’23. JMLR.org, 2023.
- E. Yang, L. Shen, Z. Wang, G. Guo, X. Chen, X. Wang, and D. Tao, “Representation surgery for multi-task model merging,” ICML, 2024.
- N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-Efficient Transfer Learning for NLP,” 2019.
- E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” 2021.
- A. Tang, L. Shen, Y. Luo, Y. Zhan, H. Hu, B. Du, Y. Chen, and D. Tao, “Parameter Efficient Multi-task Model Fusion with Partial Linearization,” 2023.
- E. Yang, L. Shen, Z. Wang, G. Guo, X. Wang, X. Cao, J. Zhang, and D. Tao, “Surgeryv2: Bridging the gap between model merging and multi-task learning with deep representation surgery,” arXiv preprint arXiv:2410.14389, 2024.
- G. Ortiz-Jimenez, A. Favero, and P. Frossard, “Task arithmetic in the tangent space: Improved editing of pre-trained models,” NeurIPS, vol. 36, 2023.
- S. P. Singh and M. Jaggi, “Model fusion via optimal transport,” NeurIPS, vol. 33, pp. 22 045–22 055, 2020.
- C. Daniel Freeman and J. Bruna, “Topology and geometry of half-rectified network optimization: 5th International Conference on Learning Representations, ICLR 2017,” 2017.
- V. Nagarajan and J. Z. Kolter, “Uniform convergence may be unable to explain generalization in deep learning,” in Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019.
- F. Draxler, K. Veschgini, M. Salmhofer, and F. A. Hamprecht, “Essentially No Barriers in Neural Network Energy Landscape,” 2019.
- J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin, “Linear Mode Connectivity and the Lottery Ticket Hypothesis,” 2020.
- R. Entezari, H. Sedghi, O. Saukh, and B. Neyshabur, “The Role of Permutation Invariance in Linear Mode Connectivity of Neural Networks,” 2022.
- Y. Li, J. Yosinski, J. Clune, H. Lipson, and J. Hopcroft, “Convergent Learning: Do different neural networks learn the same representations?” 2016.
- N. Tatro, P.-Y. Chen, P. Das, I. Melnyk, P. Sattigeri, and R. Lai, “Optimizing Mode Connectivity via Neuron Alignment,” in NeurIPS, vol. 33. Curran Associates, Inc., 2020, pp. 15 300–15 311.
- George Stoica, Daniel Bolya, Jakob Bjorner, Taylor Hearn, and Judy Hoffman, “ZipIt! Merging Models from Different Tasks without Training,” 2023.
- C. Liu, C. Lou, R. Wang, A. Y. Xi, L. Shen, and J. Yan, “Deep Neural Network Fusion via Graph Matching with Applications to Model Ensemble and Federated Learning,” in ICML. PMLR, 2022.
- S. K. Ainsworth, J. Hayase, and S. Srinivasa, “Git Re-Basin: Merging Models modulo Permutation Symmetries,” 2023.
- Y. Zhou, L. Song, B. Wang, and W. Chen, “Metagpt: Merging large language models using model exclusive task arithmetic,” arXiv preprint arXiv:2406.11385, 2024.
- T. Akiba, M. Shing, Y. Tang, Q. Sun, and D. Ha, “Evolutionary optimization of model merging recipes,” arXiv preprint arXiv:2403.13187, 2024.
- N. Daheim, T. Möllenhoff, E. M. Ponti, I. Gurevych, and M. E. Khan, “Model merging by uncertainty-based gradient matching,” arXiv preprint arXiv:2310.12808, 2023.
- M. Zimmer, C. Spiegel, and S. Pokutta, “Sparse model soups: A recipe for improved pruning via model averaging,” arXiv preprint arXiv:2306.16788, 2023.
- Y. He, Y. Hu, Y. Lin, T. Zhang, and H. Zhao, “Localize-and-stitch: Efficient model merging via sparse task arithmetic,” arXiv preprint arXiv:2408.13656, 2024.
- P. T. Deep, R. Bhardwaj, and S. Poria, “Della-merging: Reducing interference in model merging through magnitude-based sampling,” arXiv preprint arXiv:2406.11617, 2024.
- M. Davari and E. Belilovsky, “Model breadcrumbs: Scaling multi-task model merging with sparse masks,” arXiv preprint arXiv:2312.06795, 2023.
- A. Panigrahi, N. Saunshi, H. Zhao, and S. Arora, “Task-specific skill localization in fine-tuned language models,” in ICML. PMLR, 2023, pp. 27 011–27 033.
- Z. Lu, C. Fan, W. Wei, X. Qu, D. Chen, and Y. Cheng, “Twin-merging: Dynamic integration of modular expertise in model merging,” arXiv preprint arXiv:2406.15479, 2024.
- M. Muqeeth, H. Liu, and C. Raffel, “Soft merging of experts with adaptive routing,” arXiv preprint arXiv:2306.03745, 2023.
- A. Komatsuzaki, J. Puigcerver, J. Lee-Thorp, C. R. Ruiz, B. Mustafa, J. Ainslie, Y. Tay, M. Dehghani, and N. Houlsby, “Sparse upcycling: Training mixture-of-experts from dense checkpoints,” arXiv preprint arXiv:2212.05055, 2022.
- P. Ye, C. Huang, M. Shen, T. Chen, Y. Huang, Y. Zhang, and W. Ouyang, “Merging vision transformers from different tasks and domains,” arXiv preprint arXiv:2312.16240, 2023.
- M. Muqeeth, H. Liu, Y. Liu, and C. Raffel, “Learning to route among specialized experts for zero-shot generalization,” arXiv preprint arXiv:2402.05859, 2024.
- R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive Mixtures of Local Experts,” Neural Computation, vol. 3, 1991.
- A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. de las Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mixtral of Experts,” 2024.
- D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang, “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models,” 2024.
- Y. Zhou, T. Lei, H. Liu, N. Du, Y. Huang, V. Zhao, A. Dai, Z. Chen, Q. Le, and J. Laudon, “Mixture-of-Experts with Expert Choice Routing,” 2022.
- W. Fedus, B. Zoph, and N. Shazeer, “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,” 2022.
- M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer, “BASE Layers: Simplifying Training of Large, Sparse Models,” in ICML. PMLR, 2021.
- W. Fedus, J. Dean, and B. Zoph, “A Review of Sparse Expert Models in Deep Learning,” 2022.