CoSMoEs: Compact Sparse Mixture of Experts (2503.00245v1)

Published 28 Feb 2025 in cs.LG and cs.CL

Abstract: Sparse Mixture of Expert (MoE) models are popular foundational architectures at large scale, however, under-explored at smaller sizes. Here, we show how to enable Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference. Specifically, we tackle the three main on-device dimensions: Quality, Memory and Latency. Along the quality axis, we show that in a fair evaluation (removing confounding factors) MoE architectures outperform FLOP-aligned dense models at on-device scale. We introduce weight-decomposed experts, further improving the MoE model performance. Regarding model memory and latency, we significantly improve model offloading efficiency and, in turn, reduce model inference latency.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (6)

Tweets

https://twitter.com/huberpa91/status/1896999415812120713

https://twitter.com/huberpa91/status/1909391599236530541

https://twitter.com/fly51fly/status/1897043044035387726

https://twitter.com/semisance/status/1896858876563345818

https://twitter.com/useOnDevice/status/1927160440972292318

CoSMoEs: Compact Sparse Mixture of Experts (2503.00245v1)

Summary

Follow-up Questions

Related Papers

Authors (6)

Tweets