Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models (2402.01739v2)

Published 29 Jan 2024 in cs.CL, cs.AI, cs.DC, and cs.LG
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models

Abstract: To help the open-source community have a better understanding of Mixture-of-Experts (MoE) based LLMs, we train and release OpenMoE, a series of fully open-sourced and reproducible decoder-only MoE LLMs, ranging from 650M to 34B parameters and trained on up to over 1T tokens. Our investigation confirms that MoE-based LLMs can offer a more favorable cost-effectiveness trade-off than dense LLMs, highlighting the potential effectiveness for future LLM development. One more important contribution of this study is an in-depth analysis of the routing mechanisms within our OpenMoE models, leading to three significant findings: Context-Independent Specialization, Early Routing Learning, and Drop-towards-the-End. We discovered that routing decisions in MoE models are predominantly based on token IDs, with minimal context relevance. The token-to-expert assignments are determined early in the pre-training phase and remain largely unchanged. This imperfect routing can result in performance degradation, particularly in sequential tasks like multi-turn conversations, where tokens appearing later in a sequence are more likely to be dropped. Finally, we rethink our design based on the above-mentioned observations and analysis. To facilitate future MoE LLM development, we propose potential strategies for mitigating the issues we found and further improving off-the-shelf MoE LLM designs.

Introduction to OpenMoE

The open-source community recently gained a remarkable tool with the release of OpenMoE, a series of decoder-only mixture-of-experts (MoE) LLMs. These LLMs range vastly in size, including models with parameters varying from 650M to 34B, trained on extensively large datasets exceeding 1 trillion tokens. The ambition behind OpenMoE is triple-fold: to document the process of training a decoder-only MoE LLM, to delve into the intricacies of MoE routing mechanisms, and to serve as a catalyst for further MoE LLM development within the open-source milieu.

MoE Efficiency and Open Access

A central finding from the release of OpenMoE is the efficiency of MoE-based LLMs compared to their dense counterparts. MoE LLMs exhibit a more cost-effective balance, indicating their viability for future LLM endeavors. This paper details the formidable performance of OpenMoE-8B/32E models, which provide an insightful comparison with OpenLLaMA-3B and TinyLLaMA-1.1B—two dense models with a higher training cost. It's particularly notable that the OpenMoE-8B/32E-Chat model performed substantially better in single-turn conversations on the MT-Bench, indicating its potential in conversational AI applications.

In-Depth Analysis of OpenMoE

Perhaps more compelling is the in-depth examination of the routing mechanisms within MoE models. The MoE routing decisions appear to be largely token ID-based, with little regard for context. Further, routing specialization is determined early in the training phase and is predominantly unalterable. This inherent characteristic can lead to a performance decline in scenarios where a sequential understanding is critical, like multi-turn conversations due to token drops later in the sequence.

Recalibrating the Model Design

The paper does not shy away from acknowledging limitations, such as initial suboptimal design choices in MoE architecture and an overly code-heavy dataset mix. Reflecting on these aspects provides an opportunity to share learnings that could benefit model iteration and innovation in the community. To address the discovered challenges, a strategic pivot is suggested, including reducing the proportion of code in the training data mix and refining the MoE architecture to minimize context-independent token routing.

Conclusion and Future Directions

In closing, OpenMoE signifies an evolutionary step in LLM development. It delivers an enhanced understanding of MoE models, complete with strengths and areas for improvement. The research articulates potential strategies to ameliorate identified deficiencies, especially emphasizing the imperative for balanced token routing. The initiative sets the groundwork for the open-source community to push the boundaries of LLM capabilities and chart the course for subsequent endeavors in the generative AI landscape.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. “Palm 2 technical report” In arXiv preprint arXiv:2305.10403, 2023
  2. Anonymous “(InThe)WildChat: 570K ChatGPT Interaction Logs In The Wild” In The Twelfth International Conference on Learning Representations, 2024 URL: https://openreview.net/forum?id=Bl8u7ZRlbM
  3. “Efficient large scale language modeling with mixtures of experts” In arXiv preprint arXiv:2112.10684, 2021
  4. BIG-bench authors “Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models” In Transactions on Machine Learning Research, 2023 URL: https://openreview.net/forum?id=uyTL5Bvosj
  5. “Efficient training of language models to fill in the middle” In arXiv preprint arXiv:2207.14255, 2022
  6. “Findings of the 2016 Conference on Machine Translation” In Proceedings of the First Conference on Machine Translation Berlin, Germany: Association for Computational Linguistics, 2016, pp. 131–198 URL: http://www.aclweb.org/anthology/W/W16/W16-2301
  7. “Language models are few-shot learners” In arXiv preprint arXiv:2005.14165, 2020
  8. “Evaluating Large Language Models Trained on Code”, 2021 arXiv:2107.03374 [cs.LG]
  9. “Palm: Scaling language modeling with pathways” In arXiv preprint arXiv:2204.02311, 2022
  10. “Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining” In arXiv preprint arXiv:2304.09151, 2023
  11. Together Computer “RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset”, 2023 URL: https://github.com/togethercomputer/RedPajama-Data
  12. “DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models” In arXiv preprint arXiv:2401.06066, 2024
  13. “An image is worth 16x16 words: Transformers for image recognition at scale” In arXiv preprint arXiv:2010.11929, 2020
  14. “Glam: Efficient scaling of language models with mixture-of-experts” In International Conference on Machine Learning, 2022, pp. 5547–5569 PMLR
  15. William Fedus, Barret Zoph and Noam Shazeer “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity” In J. Mach. Learn. Res 23, 2021, pp. 1–40
  16. “How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources” In Yao Fu’s Notion, 2022 URL: https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1
  17. “A framework for few-shot language model evaluation” Zenodo, 2023 DOI: 10.5281/zenodo.10256836
  18. “OpenLLaMA: An Open Reproduction of LLaMA”, 2023 URL: https://github.com/openlm-research/open_llama
  19. “Measuring massive multitask language understanding” In arXiv preprint arXiv:2009.03300, 2020
  20. “Training compute-optimal large language models” In arXiv preprint arXiv:2203.15556, 2022
  21. “Mixtral of experts” In arXiv preprint arXiv:2401.04088, 2024
  22. “TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension” In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Vancouver, Canada: Association for Computational Linguistics, 2017, pp. 1601–1611 DOI: 10.18653/v1/P17-1147
  23. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova “Bert: Pre-training of deep bidirectional transformers for language understanding” In Proceedings of naacL-HLT 1, 2019, pp. 2
  24. “The Stack: 3 TB of permissively licensed source code” In Preprint, 2022
  25. “Gshard: Scaling giant models with conditional computation and automatic sharding” In arXiv preprint arXiv:2006.16668, 2020
  26. “Base layers: Simplifying training of large, sparse models” In International Conference on Machine Learning, 2021, pp. 6265–6274 PMLR
  27. Junlong Li, Zhuosheng Zhang and Hai Zhao “Self-Prompting Large Language Models for Open-Domain QA” In arXiv preprint arXiv:2212.08635, 2022
  28. “StarCoder: may the source be with you!” In arXiv preprint arXiv:2305.06161, 2023
  29. “Roberta: A robustly optimized bert pretraining approach” In arXiv preprint arXiv:1907.11692, 2019
  30. “Cross-token Modeling with Conditional Computation” In arXiv preprint arXiv:2109.02008, 2021
  31. “Multimodal contrastive learning with limoe: the language-image mixture of experts” In Advances in Neural Information Processing Systems 35, 2022, pp. 9564–9576
  32. “Xgen-7b technical report” In arXiv preprint arXiv:2309.03450, 2023
  33. “From sparse to soft mixtures of experts” In arXiv preprint arXiv:2308.00951, 2023
  34. “Scaling language models: Methods, analysis & insights from training gopher” In arXiv preprint arXiv:2112.11446, 2021
  35. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” In Journal of Machine Learning Research 21.140, 2020, pp. 1–67 URL: http://jmlr.org/papers/v21/20-074.html
  36. “Scaling vision with sparse mixture of experts” In Advances in Neural Information Processing Systems 34, 2021, pp. 8583–8595
  37. Stephen Roller, Sainbayar Sukhbaatar and Jason Weston “Hash layers for large sparse models” In Advances in Neural Information Processing Systems 34, 2021, pp. 17555–17566
  38. “Code llama: Open foundation models for code” In arXiv preprint arXiv:2308.12950, 2023
  39. Noam Shazeer “Glu variants improve transformer” In arXiv preprint arXiv:2002.05202, 2020
  40. “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer” In arXiv preprint arXiv:1701.06538, 2017
  41. “Megatron-lm: Training multi-billion parameter language models using model parallelism” In arXiv preprint arXiv:1909.08053, 2019
  42. “Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research” In arXiv preprint, 2023
  43. “Roformer: Enhanced transformer with rotary position embedding” In Neurocomputing 568 Elsevier, 2024, pp. 127063
  44. “Ul2: Unifying language learning paradigms” In The Eleventh International Conference on Learning Representations, 2022
  45. “Unifying language learning paradigms” In arXiv preprint arXiv:2205.05131, 2022
  46. LLaMA-MoE Team “LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training”, 2023 URL: https://github.com/pjlab-sys4nlp/llama-moe
  47. “Llama: Open and efficient foundation language models” In arXiv preprint arXiv:2302.13971, 2023
  48. “GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model”, https://github.com/kingoflolz/mesh-transformer-jax, 2021
  49. “CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data” In Proceedings of the Twelfth Language Resources and Evaluation Conference Marseille, France: European Language Resources Association, 2020, pp. 4003–4012 URL: https://aclanthology.org/2020.lrec-1.494
  50. “GSPMD: general and scalable parallelization for ML computation graphs” In arXiv preprint arXiv:2105.04663, 2021
  51. “One Student Knows All Experts Know: From Sparse to Dense” In arXiv preprint arXiv:2201.10890, 2022
  52. “Go wider instead of deeper” In Proceedings of the AAAI Conference on Artificial Intelligence 36.8, 2022, pp. 8779–8787
  53. “Efficient language modeling with sparse all-mlp” In arXiv preprint arXiv:2203.06850, 2022
  54. “TinyLlama: An Open-Source Small Language Model”, 2024 arXiv:2401.02385 [cs.CL]
  55. “Deep long-tailed learning: A survey” In IEEE Transactions on Pattern Analysis and Machine Intelligence IEEE, 2023
  56. “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena” In arXiv preprint arXiv:2306.05685, 2023
  57. “Brainformers: Trading simplicity for efficiency” In International Conference on Machine Learning, 2023, pp. 42531–42542 PMLR
  58. “Mixture-of-experts with expert choice routing” In Advances in Neural Information Processing Systems 35, 2022, pp. 7103–7114
  59. “St-moe: Designing stable and transferable sparse expert models” In URL https://arxiv. org/abs/2202.08906, 2022
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Fuzhao Xue (24 papers)
  2. Zian Zheng (2 papers)
  3. Yao Fu (83 papers)
  4. Jinjie Ni (18 papers)
  5. Zangwei Zheng (19 papers)
  6. Wangchunshu Zhou (73 papers)
  7. Yang You (173 papers)
Citations (53)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com