Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning More Generalized Experts by Merging Experts in Mixture-of-Experts (2405.11530v1)

Published 19 May 2024 in cs.LG

Abstract: We observe that incorporating a shared layer in a mixture-of-experts can lead to performance degradation. This leads us to hypothesize that learning shared features poses challenges in deep learning, potentially caused by the same feature being learned as various different features. To address this issue, we track each expert's usage frequency and merge the two most frequently selected experts. We then update the least frequently selected expert using the combination of experts. This approach, combined with the subsequent learning of the router's expert selection, allows the model to determine if the most frequently selected experts have learned the same feature differently. If they have, the combined expert can be further trained to learn a more general feature. Consequently, our algorithm enhances transfer learning and mitigates catastrophic forgetting when applied to multi-domain task incremental learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
  2. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
  3. Fusing finetuned models for better pretraining. arXiv preprint arXiv:2204.03044, 2022.
  4. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
  5. Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012.
  6. Don’t stop learning: Towards continual learning for the clip model. arXiv preprint arXiv:2207.09248, 2022.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  8. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004.
  9. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  10. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  11. Dense network expansion for class incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11858–11867, 2023.
  12. Experts weights averaging: A new general training scheme for vision transformers. arXiv preprint arXiv:2308.06093, 2023.
  13. Patching open-vocabulary models by interpolating weights. Advances in Neural Information Processing Systems, 35:29262–29277, 2022.
  14. Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407, 2018.
  15. Scalable and efficient moe training for multitask multilingual models. arXiv preprint arXiv:2109.10465, 2021.
  16. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
  17. Learning multiple layers of features from tiny images. 2009.
  18. Branch-train-merge: Embarrassingly parallel training of expert language models. arXiv preprint arXiv:2208.03306, 2022.
  19. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
  20. Class incremental learning via likelihood ratio based task prediction. In The Twelfth International Conference on Learning Representations, 2024.
  21. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  22. Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
  23. Multimodal contrastive learning with limoe: the language-image mixture of experts. Advances in Neural Information Processing Systems, 35:9564–9576, 2022.
  24. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  25. Sejik Park. Diverse feature learning by self-distillation and reset. arXiv preprint arXiv:2403.19941, 2024.
  26. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  27. From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Representations, 2024.
  28. Bns: Building network structures dynamically for continual learning. Advances in Neural Information Processing Systems, 34:20608–20620, 2021.
  29. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
  30. Hash layers for large sparse models. Advances in Neural Information Processing Systems, 34:17555–17566, 2021.
  31. Routing networks and the challenges of modular and compositional computation. arXiv preprint arXiv:1904.12774, 2019.
  32. Routing networks: Adaptive selection of non-linear functions for multi-task learning. In International Conference on Learning Representations, 2018.
  33. Divide and not forget: Ensemble of selectively trained experts in continual learning. In The Twelfth International Conference on Learning Representations, 2024.
  34. Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning, pages 4548–4557. PMLR, 2018.
  35. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017.
  36. Neural networks with late-phase weights. arXiv preprint arXiv:2007.12927, 2020.
  37. Foster: Feature boosting and compression for class-incremental learning. In European conference on computer vision, pages 398–414. Springer, 2022.
  38. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning, pages 23965–23998. PMLR, 2022.
  39. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7959–7971, 2022.
  40. Multi-head mixture-of-experts. arXiv preprint arXiv:2404.15045, 2024.
  41. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
  42. Boosting continual learning of vision-language models via mixture-of-experts adapters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  43. Deep mutual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4320–4328, 2018.
  44. Preventing zero-shot transfer degradation in continual learning of vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19125–19136, 2023.
  45. Expandable subspace ensemble for pre-trained model-based class-incremental learning. In CVPR, 2024.
  46. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114, 2022.
  47. Self-sustaining representation expansion for non-exemplar class-incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9296–9305, 2022.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com