Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion (2310.04361v4)

Published 6 Oct 2023 in cs.LG

Abstract: Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexplored. We demonstrate that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model. Moreover, motivated by the high variance of the number of activated neurons for different inputs, we introduce a more effective dynamic-$k$ expert selection rule that adjusts the number of executed experts on a per-token basis. To achieve further savings, we extend this approach to multi-head attention projections. Finally, we develop an efficient implementation that translates these computational savings into actual wall-clock speedup. The proposed method, Dense to Dynamic-$k$ Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, reducing inference cost by up to 60% without significantly impacting performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
  2. Unified scaling laws for routed language models. In International Conference on Machine Learning, pages 4057–4086. PMLR.
  3. Adaptively sparse transformers. arXiv preprint arXiv:1909.00015.
  4. Mobile v-moes: Scaling down vision transformers via sparse mixture-of-experts. arXiv preprint arXiv:2309.04354.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  7. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR.
  8. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270.
  9. Wikimedia Foundation. Wikimedia downloads.
  10. Georgios Georgiadis. 2019. Accelerating convolutional neural networks via activation map compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7085–7095.
  11. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 315–323. JMLR Workshop and Conference Proceedings.
  12. Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus.
  13. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456.
  14. Patrik O Hoyer. 2004. Non-negative matrix factorization with sparseness constraints. Journal of machine learning research, 5(9).
  15. Sparse is enough in scaling transformers. Advances in Neural Information Processing Systems, 34:9895–9907.
  16. Mistral 7b. arXiv preprint arXiv:2310.06825.
  17. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  18. Transformers in vision: A survey. ACM computing surveys (CSUR), 54(10s):1–41.
  19. Inducing and exploiting activation sparsity for fast inference on deep neural networks. In International Conference on Machine Learning, pages 5533–5543. PMLR.
  20. Gshard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations.
  21. The lazy neuron phenomenon: On emergence of activation sparsity in transformers. In The Eleventh International Conference on Learning Representations.
  22. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR.
  23. Mikko I Malinen and Pasi Fränti. 2014. Balanced k-means for clustering. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop, S+ SSPR 2014, Joensuu, Finland, August 20-22, 2014. Proceedings, pages 32–41. Springer.
  24. Relu strikes back: Exploiting activation sparsity in large language models. arXiv preprint arXiv:2310.04564.
  25. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  26. Language models are unsupervised multitask learners.
  27. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595.
  28. Hash layers for large sparse models. Advances in Neural Information Processing Systems, 34:17555–17566.
  29. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252.
  30. CARER: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687–3697, Brussels, Belgium. Association for Computational Linguistics.
  31. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations.
  32. Cp-vit: Cascade vision transformer pruning via progressive sparsity prediction. arXiv preprint arXiv:2203.04570.
  33. Deit iii: Revenge of the vit. In European Conference on Computer Vision, pages 516–533. Springer.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  35. Shikhar Tuli and Niraj K Jha. 2023. Acceltran: A sparsity-aware accelerator for dynamic inference with transformers. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
  36. Attention is all you need. Advances in neural information processing systems, 30.
  37. Zero time waste in pre-trained early exit neural networks. Neural Networks, 168:580–601.
  38. Moefication: Transformer feed-forward layers are mixtures of experts. In Findings of the Association for Computational Linguistics: ACL 2022, pages 877–890.
  39. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114.
  40. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV).
  41. MoEBERT: from BERT to mixture-of-experts via importance-guided adaptation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1610–1623, Seattle, United States. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Filip Szatkowski (9 papers)
  2. Bartosz Wójcik (15 papers)
  3. Mikołaj Piórczyński (1 paper)
  4. Simone Scardapane (79 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.