Papers
Topics
Authors
Recent
Search
2000 character limit reached

LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models

Published 1 Nov 2024 in cs.CL, cs.AI, and cs.LG | (2411.00918v1)

Abstract: Mixture of Experts (MoEs) plays an important role in the development of more efficient and effective LLMs. Due to the enormous resource requirements, studying large scale MoE algorithms remain in-accessible to many researchers. This work develops \emph{LibMoE}, a comprehensive and modular framework to streamline the research, training, and evaluation of MoE algorithms. Built upon three core principles: (i) modular design, (ii) efficient training; (iii) comprehensive evaluation, LibMoE brings MoE in LLMs more accessible to a wide range of researchers by standardizing the training and evaluation pipelines. Using LibMoE, we extensively benchmarked five state-of-the-art MoE algorithms over three different LLMs and 11 datasets under the zero-shot setting. The results show that despite the unique characteristics, all MoE algorithms perform roughly similar when averaged across a wide range of tasks. With the modular design and extensive evaluation, we believe LibMoE will be invaluable for researchers to make meaningful progress towards the next generation of MoE and LLMs. Project page: \url{https://fsoft-aic.github.io/fsoft-LibMoE.github.io}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Phi-3 technical report: A highly capable language model locally on your phone. ArXiv, abs/2404.14219, 2024. URL https://api.semanticscholar.org/CorpusID:269293048.
  2. Moe-rbench: Towards building reliable language models with sparse mixture-of-experts. ArXiv, abs/2406.11353, 2024a. URL https://api.semanticscholar.org/CorpusID:270560405.
  3. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024b.
  4. Are we on the right way for evaluating large vision-language models? ArXiv, abs/2403.20330, 2024c. URL https://api.semanticscholar.org/CorpusID:268793433.
  5. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms. arXiv preprint arXiv:2401.16160, 2024d.
  6. On the representation collapse of sparse mixture of experts. Advances in Neural Information Processing Systems, 35:34600–34613, 2022.
  7. Approximating two-layer feedforward networks for efficient transformers. arXiv preprint arXiv:2310.10837, 2023.
  8. Stablemoe: Stable routing strategy for mixture of experts. arXiv preprint arXiv:2204.08396, 2022.
  9. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
  10. Hyperrouter: Towards efficient training and inference of sparse mixture of experts. arXiv preprint arXiv:2312.07035, 2023.
  11. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.  5547–5569. PMLR, 2022.
  12. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  13. Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv, abs/2306.13394, 2023. URL https://api.semanticscholar.org/CorpusID:259243928.
  14. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. 2023. URL https://api.semanticscholar.org/CorpusID:265499116.
  15. Fusemoe: Mixture-of-experts transformers for fleximodal fusion. arXiv preprint arXiv:2402.03226, 2024.
  16. Fastmoe: A fast mixture-of-expert training system. arXiv preprint arXiv:2103.13262, 2021.
  17. Harder tasks need more experts: Dynamic routing in moe models. ArXiv, abs/2403.07652, 2024. URL https://api.semanticscholar.org/CorpusID:268363693.
  18. Hudson, D. A. Gqa : A new dataset for real-world visual reasoning and compositional question answering – supplementary material. 2019. URL https://api.semanticscholar.org/CorpusID:268114221.
  19. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems, 5:269–287, 2023.
  20. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  21. Mixtral of experts. ArXiv, abs/2401.04088, 2024. URL https://api.semanticscholar.org/CorpusID:266844877.
  22. A diagram is worth a dozen images. ArXiv, abs/1603.07396, 2016. URL https://api.semanticscholar.org/CorpusID:2682274.
  23. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.
  24. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  25. Aria: An open multimodal native mixture-of-experts model. 2024a. URL https://api.semanticscholar.org/CorpusID:273229053.
  26. Cumo: Scaling multimodal llm with co-upcycled mixture-of-experts. arXiv preprint arXiv:2405.05949, 2024b.
  27. Evaluating object hallucination in large vision-language models. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:258740697.
  28. Universal checkpointing: Efficient and flexible checkpointing for large scale distributed training. ArXiv, abs/2406.18820, 2024. URL https://api.semanticscholar.org/CorpusID:270764954.
  29. Moma: Efficient early-fusion pre-training with mixture of modality-aware experts. ArXiv, abs/2407.21770, 2024. URL https://api.semanticscholar.org/CorpusID:271571529.
  30. Visual instruction tuning. In NeurIPS, 2023a.
  31. Mmbench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281, 2023b. URL https://api.semanticscholar.org/CorpusID:259837088.
  32. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Neural Information Processing Systems, 2019. URL https://api.semanticscholar.org/CorpusID:199453025.
  33. Learn to explain: Multimodal reasoning via thought chains for science question answering. ArXiv, abs/2209.09513, 2022. URL https://api.semanticscholar.org/CorpusID:252383606.
  34. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations, 2023. URL https://api.semanticscholar.org/CorpusID:264491155.
  35. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. ArXiv, abs/2203.10244, 2022. URL https://api.semanticscholar.org/CorpusID:247593713.
  36. Olmoe: Open mixture-of-experts language models. ArXiv, abs/2409.02060, 2024. URL https://api.semanticscholar.org/CorpusID:272366674.
  37. Statistical advantages of perturbing cosine router in sparse mixture of experts. arXiv preprint arXiv:2405.14131, 2024a.
  38. Sigmoid gating is more sample efficient than softmax gating in mixture of experts. ArXiv, abs/2405.13997, 2024b. URL https://api.semanticscholar.org/CorpusID:269983353.
  39. On least squares estimation in softmax gating mixture of experts. arXiv preprint arXiv:2402.02952, 2024c.
  40. Competesmoe - effective training of sparse mixture of experts via competition. ArXiv, abs/2402.02526, 2024. URL https://api.semanticscholar.org/CorpusID:267411820.
  41. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  42. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  43. Towards vqa models that can read. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  8309–8318, 2019. URL https://api.semanticscholar.org/CorpusID:85553602.
  44. Branch-train-mix: Mixing expert llms into a mixture-of-experts llm. ArXiv, abs/2403.07816, 2024. URL https://api.semanticscholar.org/CorpusID:268363969.
  45. Skywork-moe: A deep dive into training techniques for mixture-of-experts language models. arXiv preprint arXiv:2406.06563, 2024.
  46. Moec: Mixture of expert clusters. In AAAI Conference on Artificial Intelligence, 2022. URL https://api.semanticscholar.org/CorpusID:250644033.
  47. Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739, 2024.
  48. Qwen2 technical report. ArXiv, abs/2407.10671, 2024. URL https://api.semanticscholar.org/CorpusID:271212307.
  49. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. ArXiv, abs/2311.16502, 2023. URL https://api.semanticscholar.org/CorpusID:265466525.
  50. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  11975–11986, 2023.
  51. Lmms-eval: Reality check on the evaluation of large multimodal models, 2024. URL https://arxiv.org/abs/2407.12772.
  52. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114, 2022.

Summary

  • The paper introduces LibMoE, a modular library that accelerates MoE research in large language models.
  • It benchmarks five MoE algorithms across 11 tasks, revealing similar average performance and the benefits of early stopping.
  • LibMoE democratizes access by enabling efficient training and rapid prototyping with reduced computational demands.

Comprehensive Benchmarking Framework for MoE Algorithms with LibMoE

The paper focuses on the creation and evaluation of LibMoE, an advanced library designed to ease research on Mixture of Experts (MoE) algorithms within the domain of LLMs. The focus here is on addressing the accessibility gap that many researchers face due to substantial computational resource demands when working with MoE algorithms on a large scale. By adhering to core principles of modular design, efficient training, and comprehensive evaluation, LibMoE presents a streamlined toolkit to facilitate MoE-related research across various LLMs and diverse benchmarks.

Overview of LibMoE

LibMoE is structured to offer extensive support for researchers by including comprehensive tools for training and evaluating MoE algorithms in LLMs. The library integrates a modular architecture that supports distributed training and customizations such as expert-router interactions and balancing losses. This modularity not only aids in the evaluation of existing MoE algorithms but also allows for rapid prototyping and development of novel methodologies. LibMoE employs state-of-the-art sparse upcycling techniques, allowing researchers to transform dense LLM checkpoints into efficient MoE variants, thereby bypassing costly pre-training stages.

Benchmarking and Evaluation

The paper highlights the application of LibMoE in conducting an exhaustive benchmarking study on five state-of-the-art MoE algorithms across various model configurations and multiple datasets. These algorithms include SMoE Router, Cosine Router, Sigmoid Router, Hyper Router, and Perturbed Cosine Router. The evaluation focuses on zero-shot settings spanning 11 benchmarks, ensuring a broad and comprehensive assessment of the MoEs' effectiveness. Surprisingly, the study reveals that, despite their unique characteristics, the overall performance of these MoE algorithms is quite similar when metrics are averaged across several tasks.

One noteworthy finding from the training process is the identification of intermediate checkpoints that outperform the final checkpoints, suggesting the potential benefits of implementing early-stopping mechanisms. Furthermore, the expert selection analysis disclosed distinct behavioral traits across algorithms, illuminating specialization patterns that hinge on the complexity of sub-tasks.

Implications and Future Directions

The introduction of LibMoE as an accessible, scalable benchmark lays the groundwork for substantial advancements in MoE studies in LLMs. By making it feasible to undertake extensive experiments with modest computational resources, LibMoE democratizes access to MoE research. The modularity and adaptability embedded within LibMoE anticipate fostering further research into MoE algorithm efficiency and generalization, making these advancements beneficial in real-world applications.

Looking forward, the empirical insights gained could lead to refinements in algorithm designs. These may include developing robust early stopping methods or optimizing the routing strategies to avoid overconfidence effects observed in certain MoE selections. The findings of this study advocate for continued exploration into understanding how architectural choices, such as alternative vision encoders like Siglip, impact MoE performance.

In conclusion, the thorough analysis and results portrayed in this work underscore LibMoE's potential to propel further academic inquiries into MoE algorithms, broadening their applications across new and emerging areas of AI research. As the landscape of LLMs evolves, such frameworks will be pivotal in addressing both theoretical explorations and practical implementations, setting a significant milestone in the journey towards sophisticated language modeling paradigms.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 6 likes about this paper.