XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection (2403.18926v2)
Abstract: Sparse models, including sparse Mixture-of-Experts (MoE) models, have emerged as an effective approach for scaling Transformer models. However, they often suffer from computational inefficiency since a significant number of parameters are unnecessarily involved in computations via multiplying values by zero or low activation values. To address this issue, we present \tool, a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models. \tool leverages small experts and a threshold-based router to enable tokens to selectively engage only essential parameters. Our extensive experiments on LLMing and machine translation tasks demonstrate that \tool can enhance model performance while decreasing the computation load at MoE layers by over 50\% without sacrificing performance. Furthermore, we present the versatility of \tool by applying it to dense models, enabling sparse computation during inference. We provide a comprehensive analysis and make our code available at https://github.com/ysngki/XMoE.
- Language models are few-shot learners. ArXiv.
- Sparse moe as the new dropout: Scaling dense and self-slimmable transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
- Scaling instruction-finetuned language models. ArXiv.
- Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, Dublin, Ireland. Association for Computational Linguistics.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23:120:1–120:39.
- Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus.
- Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems, 34:9895–9907.
- Large memory layers with product keys. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8546–8557.
- Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- BASE layers: Simplifying training of large, sparse models. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 6265–6274. PMLR.
- Adaptive gating in mixture-of-experts based language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3577–3587, Singapore. Association for Computational Linguistics.
- The lazy neuron phenomenon: On emergence of activation sparsity in transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
- Pointer sentinel mixture models. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
- fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
- From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Representations.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
- Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation AI scale. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162, pages 18332–18346.
- Hash layers for large sparse models. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 17555–17566.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv.
- Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
- Exploring sparse expert models and beyond. CoRR, abs/2105.15082.
- MoEfication: Transformer feed-forward layers are mixtures of experts. In Findings of the Association for Computational Linguistics: ACL 2022, pages 877–890, Dublin, Ireland. Association for Computational Linguistics.
- Mixture-of-experts with expert choice routing. In NeurIPS.
- St-moe: Designing stable and transferable sparse expert models.
- Yuanhang Yang (8 papers)
- Shiyi Qi (9 papers)
- Wenchao Gu (10 papers)
- Chaozheng Wang (28 papers)
- Cuiyun Gao (97 papers)
- Zenglin Xu (145 papers)