Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
135 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Towards an empirical understanding of MoE design choices (2402.13089v1)

Published 20 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: In this study, we systematically evaluate the impact of common design choices in Mixture of Experts (MoEs) on validation performance, uncovering distinct influences at token and sequence levels. We also present empirical evidence showing comparable performance between a learned router and a frozen, randomly initialized router, suggesting that learned routing may not be essential. Our study further reveals that Sequence-level routing can result in topic-specific weak expert specialization, in contrast to syntax specialization observed with Token-level routing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Unified scaling laws for routed language models. In International Conference on Machine Learning, pp.  4057–4086. PMLR, 2022.
  2. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018.
  3. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models, 2024.
  4. On the benefits of learning to route in mixture-of-experts models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  9376–9396, 2023.
  5. GLaM: Efficient scaling of language models with mixture-of-experts. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  5547–5569. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/du22c.html.
  6. Beyond english-centric multilingual machine translation. Journal of Machine Learning Research, 22(107):1–48, 2021.
  7. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  8. Wikimedia Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org.
  9. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  10. Daniel Grittner. nanogpt. https://github.com/danielgrittner/nanoGPT-LoRA, 2023.
  11. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  12. Training compute-optimal large language models, 2022.
  13. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  14. Mixtral of experts, 2024.
  15. Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT/, 2023.
  16. Exploring routing strategies for multilingual mixture-of-experts models, 2021. URL https://openreview.net/forum?id=ey1XXNzcIZS.
  17. Gshard: Scaling giant models with conditional computation and automatic sharding, 2020.
  18. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, pp.  6265–6274. PMLR, 2021.
  19. Branch-train-merge: Embarrassingly parallel training of expert language models, 2022.
  20. Dense-to-sparse gate for mixture-of-experts, 2022. URL https://openreview.net/forum?id=_4D8IVs7yO8.
  21. Task-based moe for multitask multilingual machine translation, 2023.
  22. Hash layers for large sparse models. Advances in Neural Information Processing Systems, 34:17555–17566, 2021.
  23. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  24. X-stance: A multilingual multi-target dataset for stance detection, 2020.
  25. Openmoe: An early effort on open mixture-of-experts language models. preprint, 2023.
  26. M6-t: Exploring sparse expert models and beyond. arXiv preprint arXiv:2105.15082, 2021.
  27. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022a.
  28. St-moe: Designing stable and transferable sparse expert models, 2022b.
Citations (8)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com