Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models (2404.05567v1)

Published 8 Apr 2024 in cs.LG, cs.AI, and cs.CL
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Abstract: Mixture-of-Experts (MoE) LLMs can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4$\times$ times more parameters to achieve comparable performance to a dense model, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. In this work, we propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Our experiments on training LLMs demonstrate that our DS-MoE models are more parameter-efficient than standard sparse MoEs and are on par with dense models in terms of total parameter size and performance while being computationally cheaper (activating 30-40% of the model's parameters). Performance tests using vLLM show that our DS-MoE-6B model runs up to $1.86\times$ faster than similar dense models like Mistral-7B, and between $1.50\times$ and $1.71\times$ faster than comparable MoEs, such as DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B.

Dense Training, Sparse Inference: Optimizing Mixture-of-Experts LLMs

Introduction

The dichotomy between the computational cost of training LLMs and the necessity for efficiency during inference stages presents a significant challenge in deep learning. Mixture-of-Experts (MoE) models emerged as a viable solution by facilitating selective parameter utilization, which increases computational efficiency while maintaining, or even enhancing, model performance. Nevertheless, MoE models' overwhelming parameter requirements, often 2 to 4 times that of dense models, exacerbate memory consumption and decrease efficiency in autoregressive tasks. This paper introduces a hybrid approach, employing dense training coupled with sparse inference (DS-MoE), aimed at retaining the computational benefits of MoE models while mitigating their parameter inefficiency.

Methodology

Dense Training

The cornerstone of the DS-MoE framework is the adoption of dense gradient propagation during the training phase, involving all experts in the computation, as opposed to traditional sparse training methods. This full participation ensures efficient GPU utilization and balances expert usage, a common pitfall in sparse training. The Mutual Information (MI) loss is introduced, promoting load balance among experts and ensuring an even distribution of the computational load. This loss complements the standard autoregressive modeling loss, strategically balancing the model's focus on primary tasks and expert efficiency.

Sparse Inference

For inference, DS-MoE models revert to sparsity, activating only a subset of experts based on routing scores or a predefined threshold. This approach significantly reduces the computational load, preserving the model's efficiency during the inference stage. The implementation also features Mixture of Attention Head (MoA) blocks further optimizing the model's computational demand by efficiently managing attention mechanisms.

Results and Discussion

Empirical evaluations underscore the DS-MoE model's capability to closely rival dense models in performance while significantly outstripping traditional MoE models in parameter efficiency. Key findings include:

  • DS-MoE models manifest an impressive reduction in required parameters for comparable performance levels, effectively addressing the inefficiencies associated with MoEs.
  • The approach achieves a 30-40% activation rate of the model's parameters during inference, striking a balance between computational efficiency and model performance.
  • Enhanced throughput in both computation-bounded and I/O-bounded scenarios demonstrates the DS-MoE model's superior efficiency across diverse operational contexts.

These results underscore the utility of the DS-MoE framework in making MoE models more tractable and efficient, particularly in environments where computational and memory resources are at a premium.

Future Directions

This research opens promising avenues for further optimization and exploration in the training and inference paradigms of LLMs. Future work may delve into refining the mutual information loss to foster even greater efficiency and exploring the scalability of the DS-MoE approach for models beyond the scope of current experiments. Additionally, the dynamic nature of the sparse inference process offers a fertile ground for developing more adaptive and context-aware routing mechanisms, potentially tailoring computational efforts to the specific demands of given tasks or inputs.

Conclusion

The proposed DS-MoE framework marks a significant step forward in resolving the intrinsic tension between the desire for large, expressive models and the imperative for computational efficiency. By merging dense training with sparse inference, this approach promises to make large-scale models more accessible and practical for a broader range of applications, advancing the state-of-the-art in efficient LLMing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  3. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  7432–7439, 2020.
  4. Once-for-all: Train one network and specialize it for efficient deployment. arXiv preprint arXiv:1908.09791, 2019.
  5. Model preserving compression for neural networks. Advances in Neural Information Processing Systems, 35:38060–38074, 2022.
  6. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023.
  7. On the representation collapse of sparse mixture of experts. ArXiv, abs/2204.09179, 2022. URL https://api.semanticscholar.org/CorpusID:248266346.
  8. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  9. Stablemoe: Stable routing strategy for mixture of experts. In Annual Meeting of the Association for Computational Linguistics, 2022. URL https://api.semanticscholar.org/CorpusID:248227505.
  10. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:52967399.
  12. Glam: Efficient scaling of language models with mixture-of-experts. ArXiv, abs/2112.06905, 2021. URL https://api.semanticscholar.org/CorpusID:245124124.
  13. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  14. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  15. Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5, 2023.
  16. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  17. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  18. Sparsely activated mixture-of-experts are robust multi-task learners. ArXiv, abs/2204.07689, 2022. URL https://api.semanticscholar.org/CorpusID:248227728.
  19. Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning. In Neural Information Processing Systems, 2021. URL https://api.semanticscholar.org/CorpusID:235358484.
  20. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  21. Long short-term memory. Neural Computation, 9:1735–1780, 1997. URL https://api.semanticscholar.org/CorpusID:1915014.
  22. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  23. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  24. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  25. Sparse upcycling: Training mixture-of-experts from dense checkpoints. ArXiv, abs/2212.05055, 2022. URL https://api.semanticscholar.org/CorpusID:254535822.
  26. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems, 5, 2023.
  27. Beyond distillation: Task-level mixture-of-experts for efficient inference. ArXiv, abs/2110.03742, 2021. URL https://api.semanticscholar.org/CorpusID:238531628.
  28. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  29. Gshard: Scaling giant models with conditional computation and automatic sharding. ArXiv, abs/2006.16668, 2020. URL https://api.semanticscholar.org/CorpusID:220265858.
  30. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.  19274–19286. PMLR, 2023.
  31. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, 2021. URL https://api.semanticscholar.org/CorpusID:232428341.
  32. Branch-train-merge: Embarrassingly parallel training of expert language models. ArXiv, abs/2208.03306, 2022. URL https://api.semanticscholar.org/CorpusID:251371375.
  33. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  34. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE international conference on computer vision, pp.  2736–2744, 2017.
  35. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pp.  22137–22176. PMLR, 2023.
  36. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  37. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp.  5058–5066, 2017.
  38. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  39. Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
  40. Is a modular architecture enough? ArXiv, abs/2206.02713, 2022. URL https://api.semanticscholar.org/CorpusID:249395289.
  41. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1325–1334, 2019.
  42. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pp.  7197–7206. PMLR, 2020.
  43. Codegen: An open large language model for code with multi-turn program synthesis. ICLR, 2023.
  44. Va-red22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Video adaptive redundancy reduction. arXiv preprint arXiv:2102.07887, 2021.
  45. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  46. Emergent mixture-of-experts: Can dense pre-trained transformers benefit from emergent modular structures? arXiv preprint arXiv:2310.10908, 2023.
  47. Improving language understanding by generative pre-training. 2018. URL https://api.semanticscholar.org/CorpusID:49313245.
  48. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020.
  49. Hash layers for large sparse models. In Neural Information Processing Systems, 2021. URL https://api.semanticscholar.org/CorpusID:235367626.
  50. Routing networks and the challenges of modular and compositional computation. ArXiv, abs/1904.12774, 2019. URL https://api.semanticscholar.org/CorpusID:139103965.
  51. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  52. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  53. Mixture-of-experts meets instruction tuning:a winning combination for large language models. 2023a. URL https://api.semanticscholar.org/CorpusID:259342096.
  54. Mixture-of-experts meets instruction tuning: A winning combination for large language models. arXiv preprint arXiv:2305.14705, 2023b.
  55. Moduleformer: Learning modular large language models from uncurated data. ArXiv, abs/2306.04640, 2023c. URL https://api.semanticscholar.org/CorpusID:259096191.
  56. Moduleformer: Learning modular large language models from uncurated data. arXiv preprint arXiv:2306.04640, 2023d.
  57. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018.
  58. Scattered mixture-of-experts implementation. arXiv preprint arXiv:2403.08245, 2024.
  59. Attention is all you need. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/CorpusID:13756489.
  60. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418, 2019.
  61. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8612–8620, 2019.
  62. Skipnet: Learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  409–424, 2018.
  63. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209, 2017.
  64. Learning structured sparsity in deep neural networks. Advances in neural information processing systems, 29, 2016.
  65. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp.  38–45, 2020.
  66. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  8817–8826, 2018.
  67. Structured pruning learns compact and accurate models. In Association for Computational Linguistics (ACL), 2022.
  68. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
  69. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp.  38087–38099. PMLR, 2023.
  70. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  71. Mixture of attention heads: Selecting attention heads per token. arXiv preprint arXiv:2210.05144, 2022.
  72. Moefication: Transformer feed-forward layers are mixtures of experts. In Findings, 2021a. URL https://api.semanticscholar.org/CorpusID:247958465.
  73. Moefication: Transformer feed-forward layers are mixtures of experts. arXiv preprint arXiv:2110.01786, 2021b.
  74. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277, 2023.
  75. Mixture-of-experts with expert choice routing. ArXiv, abs/2202.09368, 2022. URL https://api.semanticscholar.org/CorpusID:247011948.
  76. St-moe: Designing stable and transferable sparse expert models. 2022. URL https://api.semanticscholar.org/CorpusID:248496391.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Bowen Pan (16 papers)
  2. Yikang Shen (62 papers)
  3. Haokun Liu (26 papers)
  4. Mayank Mishra (38 papers)
  5. Gaoyuan Zhang (18 papers)
  6. Aude Oliva (42 papers)
  7. Colin Raffel (83 papers)
  8. Rameswar Panda (79 papers)
Citations (12)
Youtube Logo Streamline Icon: https://streamlinehq.com