Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget (2308.15030v4)

Published 29 Aug 2023 in cs.AI

Abstract: Mixture of experts (MoE) is a popular technique to improve capacity of LLMs with conditionally-activated parallel experts. However, serving MoE models on memory-constrained devices is challenging due to the large parameter size. Typical solutions such as memory swapping or expert pruning may lead to significantly higher latency or severe accuracy loss. In this paper, we introduce SwapMoE, a framework for efficient serving of MoE-based LLMs with tunable memory budgets. The main idea of SwapMoE is to keep a small dynamic set of important experts, namely Virtual Experts, in the main memory for inference, while seamlessly maintaining how the Virtual Experts map to the actual experts. Experiments have shown that SwapMoE can reduce the memory footprint while maintaining reasonable accuracy. For example, on text summarization tasks with Switch Transformer, SwapMoE can reduce the memory consumption from 14.2 GiB to 4.7 GiB, together with 50\% latency reduction and a slight Rouge-2 score drop of 0.041.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. 2023. General-purpose Swich transformer based Japanese language model. https://huggingface.co/Tanrei/GPTSAN-japanese
  2. 2023. wikipedia-japanese datasets at hugging face. https://huggingface.co/datasets/inarikami/wikipedia-japanese
  3. Efficient Large Scale Language Modeling with Mixtures of Experts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 11699–11732. https://aclanthology.org/2022.emnlp-main.804
  4. Task-Specific Expert Pruning for Sparse Mixture-of-Experts. arXiv:cs.LG/2206.00277
  5. Optimizing Dynamic Neural Networks with Brainstorm. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 797–815.
  6. NestDNN: Resource-Aware Multi-Tenant On-Device Deep Learning for Continuous Mobile Vision. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking (MobiCom ’18). Association for Computing Machinery, New York, NY, USA, 115–127. https://doi.org/10.1145/3241539.3241559
  7. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research 23, 120 (2022), 1–39. http://jmlr.org/papers/v23/21-0998.html
  8. GadgetVersus. 2023. Apple iPhone 4 vs Apple iPhone 14 Benchmarks, Specs, Performance Comparison and Differences. https://gadgetversus.com/smartphone/apple-iphone-4-vs-apple-iphone-14/, accessed: 2023-08-01.
  9. Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models. arXiv:cs.CL/2203.01104
  10. LegoDNN: Block-Grained Scaling of Deep Neural Networks for Mobile Vision. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking (MobiCom ’21). Association for Computing Machinery, New York, NY, USA, 406–419. https://doi.org/10.1145/3447993.3483249
  11. NeuLens: Spatial-Based Dynamic Acceleration of Convolutional Neural Networks on Edge. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking (MobiCom ’22). Association for Computing Machinery, New York, NY, USA, 186–199. https://doi.org/10.1145/3495243.3560528
  12. Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference. arXiv:cs.DC/2303.06182
  13. Tutel: Adaptive Mixture-of-Experts at Scale. ArXiv abs/2206.03382 (2022).
  14. AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers. arXiv:cs.CL/2210.07535
  15. Flexible high-resolution object detection on edge devices with tunable latency. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 559–572.
  16. Poole John. 2016. Geekbench 4. https://www.geekbench.com/blog/2016/08/geekbench-4/, accessed: 2023-08-06.
  17. Scaling Laws for Neural Language Models. arXiv:cs.LG/2001.08361
  18. Scalable and efficient moe training for multitask multilingual models. arXiv preprint arXiv:2109.10465 (2021).
  19. ConvReLU++: Reference-Based Lossless Acceleration of Conv-ReLU Operations on Mobile CPU. In Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services (MobiSys ’23). Association for Computing Machinery, New York, NY, USA, 503–515. https://doi.org/10.1145/3581791.3596831
  20. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference. arXiv:cs.CL/2110.03742
  21. Deepx: A software accelerator for low-power deep learning inference on mobile devices. In 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). IEEE, 1–12.
  22. GShard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020).
  23. Sparse Mixture-of-Experts are Domain Generalizable Learners. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=RecZ9nB9Q4
  24. M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design. arXiv:cs.CV/2210.14793
  25. Swin Transformer V2: Scaling Up Capacity and Resolution. In International Conference on Computer Vision and Pattern Recognition (CVPR).
  26. Scaling Vision with Sparse Mixture of Experts. In Advances in Neural Information Processing Systems, Vol. 34. 8583–8595. https://proceedings.neurips.cc/paper_files/paper/2021/file/48237d9f2dea8c74c2a72126cf63d933-Paper.pdf
  27. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211–252. https://doi.org/10.1007/s11263-015-0816-y
  28. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations. https://openreview.net/forum?id=B1ckMDqlg
  29. Nimble: Efficiently compiling dynamic neural networks for model inference. Proceedings of Machine Learning and Systems 3 (2021), 208–222.
  30. SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System. ArXiv abs/2205.10034 (2022).
  31. SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation. In International Conference on Computer Vision and Pattern Recognition (CVPR).
  32. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  33. Accelerate inference of CNNs for video analysis while preserving exactness exploiting activation sparsity. Proceedings of Machine Learning and Systems 3 (2021), 860–872.
  34. AdaptiveNet: Post-deployment Neural Architecture Adaptation for Diverse Edge Environments. arXiv:cs.LG/2303.07129
  35. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
  36. Go Wider Instead of Deeper. In AAAI Conference on Artificial Intelligence.
  37. Edgemoe: Fast on-device inference of moe-based large language models. arXiv preprint arXiv:2308.14352 (2023).
  38. SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts. In Proc. Interspeech 2021. 2077–2081. https://doi.org/10.21437/Interspeech.2021-478
  39. 3M: Multi-loss, Multi-path and Multi-level Neural Networks for speech recognition. arXiv preprint arXiv:2204.03178 (2022).
  40. Nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices. In Proceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys ’21). Association for Computing Machinery, New York, NY, USA, 81–93. https://doi.org/10.1145/3458864.3467882
  41. Dynamic slicing for deep neural networks. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 838–850.
  42. MoEfication: Transformer Feed-forward Layers are Mixtures of Experts. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 877–890. https://doi.org/10.18653/v1/2022.findings-acl.71
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Rui Kong (9 papers)
  2. Yuanchun Li (37 papers)
  3. Qingtian Feng (3 papers)
  4. Weijun Wang (21 papers)
  5. Linghe Kong (44 papers)
  6. Yunxin Liu (58 papers)
  7. Xiaozhou Ye (18 papers)
  8. Ye Ouyang (16 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.