SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget (2308.15030v4)
Abstract: Mixture of experts (MoE) is a popular technique to improve capacity of LLMs with conditionally-activated parallel experts. However, serving MoE models on memory-constrained devices is challenging due to the large parameter size. Typical solutions such as memory swapping or expert pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient serving of MoE-based LLMs with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. Experiments have shown that SwapMoE can reduce the memory footprint while maintaining reasonable accuracy. For example, on text summarization tasks with Switch Transformer, SwapMoE can reduce the memory consumption from 14.2 GiB to 4.7 GiB, together with 50\% latency reduction and a slight Rouge-2 score drop of 0.041.
- 2023. General-purpose Swich transformer based Japanese language model. https://huggingface.co/Tanrei/GPTSAN-japanese
- 2023. wikipedia-japanese datasets at hugging face. https://huggingface.co/datasets/inarikami/wikipedia-japanese
- Efficient Large Scale Language Modeling with Mixtures of Experts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 11699–11732. https://aclanthology.org/2022.emnlp-main.804
- Task-Specific Expert Pruning for Sparse Mixture-of-Experts. arXiv:cs.LG/2206.00277
- Optimizing Dynamic Neural Networks with Brainstorm. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 797–815.
- NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking (MobiCom ’18). Association for Computing Machinery, New York, NY, USA, 115–127. https://doi.org/10.1145/3241539.3241559
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research 23, 120 (2022), 1–39. http://jmlr.org/papers/v23/21-0998.html
- GadgetVersus. 2023. Apple iPhone 4 vs Apple iPhone 14 Benchmarks, Specs, Performance Comparison and Differences. https://gadgetversus.com/smartphone/apple-iphone-4-vs-apple-iphone-14/, accessed: 2023-08-01.
- Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models. arXiv:cs.CL/2203.01104
- LegoDNN: Block-Grained Scaling of Deep Neural Networks for Mobile Vision. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking (MobiCom ’21). Association for Computing Machinery, New York, NY, USA, 406–419. https://doi.org/10.1145/3447993.3483249
- NeuLens: Spatial-Based Dynamic Acceleration of Convolutional Neural Networks on Edge. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking (MobiCom ’22). Association for Computing Machinery, New York, NY, USA, 186–199. https://doi.org/10.1145/3495243.3560528
- Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference. arXiv:cs.DC/2303.06182
- Tutel: Adaptive Mixture-of-Experts at Scale. ArXiv abs/2206.03382 (2022).
- AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers. arXiv:cs.CL/2210.07535
- Flexible high-resolution object detection on edge devices with tunable latency. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 559–572.
- Poole John. 2016. Geekbench 4. https://www.geekbench.com/blog/2016/08/geekbench-4/, accessed: 2023-08-06.
- Scaling Laws for Neural Language Models. arXiv:cs.LG/2001.08361
- Scalable and efficient moe training for multitask multilingual models. arXiv preprint arXiv:2109.10465 (2021).
- ConvReLU++: Reference-Based Lossless Acceleration of Conv-ReLU Operations on Mobile CPU. In Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services (MobiSys ’23). Association for Computing Machinery, New York, NY, USA, 503–515. https://doi.org/10.1145/3581791.3596831
- Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference. arXiv:cs.CL/2110.03742
- Deepx: A software accelerator for low-power deep learning inference on mobile devices. In 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). IEEE, 1–12.
- GShard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020).
- Sparse Mixture-of-Experts are Domain Generalizable Learners. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=RecZ9nB9Q4
- M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design. arXiv:cs.CV/2210.14793
- Swin Transformer V2: Scaling Up Capacity and Resolution. In International Conference on Computer Vision and Pattern Recognition (CVPR).
- Scaling Vision with Sparse Mixture of Experts. In Advances in Neural Information Processing Systems, Vol. 34. 8583–8595. https://proceedings.neurips.cc/paper_files/paper/2021/file/48237d9f2dea8c74c2a72126cf63d933-Paper.pdf
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211–252. https://doi.org/10.1007/s11263-015-0816-y
- Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations. https://openreview.net/forum?id=B1ckMDqlg
- Nimble: Efficiently compiling dynamic neural networks for model inference. Proceedings of Machine Learning and Systems 3 (2021), 208–222.
- SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System. ArXiv abs/2205.10034 (2022).
- SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation. In International Conference on Computer Vision and Pattern Recognition (CVPR).
- Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
- Accelerate inference of CNNs for video analysis while preserving exactness exploiting activation sparsity. Proceedings of Machine Learning and Systems 3 (2021), 860–872.
- AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge Environments. arXiv:cs.LG/2303.07129
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
- Go Wider Instead of Deeper. In AAAI Conference on Artificial Intelligence.
- Edgemoe: Fast on-device inference of moe-based large language models. arXiv preprint arXiv:2308.14352 (2023).
- SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts. In Proc. Interspeech 2021. 2077–2081. https://doi.org/10.21437/Interspeech.2021-478
- 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition. arXiv preprint arXiv:2204.03178 (2022).
- Nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys ’21). Association for Computing Machinery, New York, NY, USA, 81–93. https://doi.org/10.1145/3458864.3467882
- Dynamic slicing for deep neural networks. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 838–850.
- MoEfication: Transformer Feed-forward Layers are Mixtures of Experts. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 877–890. https://doi.org/10.18653/v1/2022.findings-acl.71
- Rui Kong (9 papers)
- Yuanchun Li (37 papers)
- Qingtian Feng (3 papers)
- Weijun Wang (21 papers)
- Linghe Kong (44 papers)
- Yunxin Liu (58 papers)
- Xiaozhou Ye (18 papers)
- Ye Ouyang (16 papers)