LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
Abstract: Mixture of Experts (MoEs) plays an important role in the development of more efficient and effective LLMs. Due to the enormous resource requirements, studying large scale MoE algorithms remain in-accessible to many researchers. This work develops \emph{LibMoE}, a comprehensive and modular framework to streamline the research, training, and evaluation of MoE algorithms. Built upon three core principles: (i) modular design, (ii) efficient training; (iii) comprehensive evaluation, LibMoE brings MoE in LLMs more accessible to a wide range of researchers by standardizing the training and evaluation pipelines. Using LibMoE, we extensively benchmarked five state-of-the-art MoE algorithms over three different LLMs and 11 datasets under the zero-shot setting. The results show that despite the unique characteristics, all MoE algorithms perform roughly similar when averaged across a wide range of tasks. With the modular design and extensive evaluation, we believe LibMoE will be invaluable for researchers to make meaningful progress towards the next generation of MoE and LLMs. Project page: \url{https://fsoft-aic.github.io/fsoft-LibMoE.github.io}.
- Phi-3 technical report: A highly capable language model locally on your phone. ArXiv, abs/2404.14219, 2024. URL https://api.semanticscholar.org/CorpusID:269293048.
- Moe-rbench: Towards building reliable language models with sparse mixture-of-experts. ArXiv, abs/2406.11353, 2024a. URL https://api.semanticscholar.org/CorpusID:270560405.
- Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024b.
- Are we on the right way for evaluating large vision-language models? ArXiv, abs/2403.20330, 2024c. URL https://api.semanticscholar.org/CorpusID:268793433.
- Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms. arXiv preprint arXiv:2401.16160, 2024d.
- On the representation collapse of sparse mixture of experts. Advances in Neural Information Processing Systems, 35:34600–34613, 2022.
- Approximating two-layer feedforward networks for efficient transformers. arXiv preprint arXiv:2310.10837, 2023.
- Stablemoe: Stable routing strategy for mixture of experts. arXiv preprint arXiv:2204.08396, 2022.
- Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
- Hyperrouter: Towards efficient training and inference of sparse mixture of experts. arXiv preprint arXiv:2312.07035, 2023.
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp. 5547–5569. PMLR, 2022.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv, abs/2306.13394, 2023. URL https://api.semanticscholar.org/CorpusID:259243928.
- Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. 2023. URL https://api.semanticscholar.org/CorpusID:265499116.
- Fusemoe: Mixture-of-experts transformers for fleximodal fusion. arXiv preprint arXiv:2402.03226, 2024.
- Fastmoe: A fast mixture-of-expert training system. arXiv preprint arXiv:2103.13262, 2021.
- Harder tasks need more experts: Dynamic routing in moe models. ArXiv, abs/2403.07652, 2024. URL https://api.semanticscholar.org/CorpusID:268363693.
- Hudson, D. A. Gqa : A new dataset for real-world visual reasoning and compositional question answering – supplementary material. 2019. URL https://api.semanticscholar.org/CorpusID:268114221.
- Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems, 5:269–287, 2023.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- Mixtral of experts. ArXiv, abs/2401.04088, 2024. URL https://api.semanticscholar.org/CorpusID:266844877.
- A diagram is worth a dozen images. ArXiv, abs/1603.07396, 2016. URL https://api.semanticscholar.org/CorpusID:2682274.
- Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.
- Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
- Aria: An open multimodal native mixture-of-experts model. 2024a. URL https://api.semanticscholar.org/CorpusID:273229053.
- Cumo: Scaling multimodal llm with co-upcycled mixture-of-experts. arXiv preprint arXiv:2405.05949, 2024b.
- Evaluating object hallucination in large vision-language models. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:258740697.
- Universal checkpointing: Efficient and flexible checkpointing for large scale distributed training. ArXiv, abs/2406.18820, 2024. URL https://api.semanticscholar.org/CorpusID:270764954.
- Moma: Efficient early-fusion pre-training with mixture of modality-aware experts. ArXiv, abs/2407.21770, 2024. URL https://api.semanticscholar.org/CorpusID:271571529.
- Visual instruction tuning. In NeurIPS, 2023a.
- Mmbench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281, 2023b. URL https://api.semanticscholar.org/CorpusID:259837088.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Neural Information Processing Systems, 2019. URL https://api.semanticscholar.org/CorpusID:199453025.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. ArXiv, abs/2209.09513, 2022. URL https://api.semanticscholar.org/CorpusID:252383606.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In International Conference on Learning Representations, 2023. URL https://api.semanticscholar.org/CorpusID:264491155.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. ArXiv, abs/2203.10244, 2022. URL https://api.semanticscholar.org/CorpusID:247593713.
- Olmoe: Open mixture-of-experts language models. ArXiv, abs/2409.02060, 2024. URL https://api.semanticscholar.org/CorpusID:272366674.
- Statistical advantages of perturbing cosine router in sparse mixture of experts. arXiv preprint arXiv:2405.14131, 2024a.
- Sigmoid gating is more sample efficient than softmax gating in mixture of experts. ArXiv, abs/2405.13997, 2024b. URL https://api.semanticscholar.org/CorpusID:269983353.
- On least squares estimation in softmax gating mixture of experts. arXiv preprint arXiv:2402.02952, 2024c.
- Competesmoe - effective training of sparse mixture of experts via competition. ArXiv, abs/2402.02526, 2024. URL https://api.semanticscholar.org/CorpusID:267411820.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Towards vqa models that can read. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8309–8318, 2019. URL https://api.semanticscholar.org/CorpusID:85553602.
- Branch-train-mix: Mixing expert llms into a mixture-of-experts llm. ArXiv, abs/2403.07816, 2024. URL https://api.semanticscholar.org/CorpusID:268363969.
- Skywork-moe: A deep dive into training techniques for mixture-of-experts language models. arXiv preprint arXiv:2406.06563, 2024.
- Moec: Mixture of expert clusters. In AAAI Conference on Artificial Intelligence, 2022. URL https://api.semanticscholar.org/CorpusID:250644033.
- Openmoe: An early effort on open mixture-of-experts language models. arXiv preprint arXiv:2402.01739, 2024.
- Qwen2 technical report. ArXiv, abs/2407.10671, 2024. URL https://api.semanticscholar.org/CorpusID:271212307.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. ArXiv, abs/2311.16502, 2023. URL https://api.semanticscholar.org/CorpusID:265466525.
- Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11975–11986, 2023.
- Lmms-eval: Reality check on the evaluation of large multimodal models, 2024. URL https://arxiv.org/abs/2407.12772.
- Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.