Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Toward Inference-optimal Mixture-of-Expert Large Language Models (2404.02852v1)

Published 3 Apr 2024 in cs.LG
Toward Inference-optimal Mixture-of-Expert Large Language Models

Abstract: Mixture-of-Expert (MoE) based LLMs, such as the recent Mixtral and DeepSeek-MoE, have shown great promise in scaling model size without suffering from the quadratic growth of training cost of dense transformers. Like dense models, training MoEs requires answering the same question: given a training budget, what is the optimal allocation on the model size and number of tokens? We study the scaling law of MoE-based LLMs regarding the relations between the model performance, model size, dataset size, and the expert degree. Echoing previous research studying MoE in different contexts, we observe the diminishing return of increasing the number of experts, but this seems to suggest we should scale the number of experts until saturation, as the training cost would remain constant, which is problematic during inference time. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss. We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. On the other hand, training a (16/32) expert MoE much smaller (70-85%) than the loss-optimal solution, but with a larger training dataset is a promising setup under a training budget.

Optimizing Training and Inference in MoE-based LLMs: A Novel Budget Allocation Perspective

Introduction

Mixture of Experts (MoE) models have increasingly become a focal point in the advancement of LLMs, offering a scalable alternative that promises significant performance improvements without a proportional increase in computational costs. The core of MoE architecture lies in its ability to route input tokens to a subset of experts, thus leveraging a more extensive network capacity efficiently. Despite the apparent advantages, the optimal number of experts remains a critical, yet not fully understood parameter. This paper contributes to the ongoing discourse by investigating the optimal number of experts within MoE models in relation to model size, dataset size, expert degree, and importantly, considering inference efficiency alongside validation loss.

Scaling Behavior and Inference Efficiency of MoE Models

The research meticulously extends the existing scaling law framework to include MoE models, examining how validation loss scales with model size, dataset size, and the number of experts. It discovers a power-law relationship among these factors and validation loss, aligning with previous findings in the domain. Crucially, it identifies a diminishing return upon increasing the number of experts, a phenomenon that encourages the exploration of efficiency beyond mere loss optimization.

The inference efficiency, characterized through cost per token, emerges as a pivotal concern, altering the previously loss-centric view of model optimization. The analysis reveals that while MoE models with a smaller number of experts (4 or 8) demonstrate superior serving efficiency under equivalent performance conditions, they incur significantly higher training costs. Conversely, configuring an MoE model with a higher number of experts (16/32) but smaller than the loss-optimal size and compensating with a larger training dataset presents a cost-effective strategy under fixed training budgets.

Implications and Future Directions

The paper underscores a strategic pivot in optimizing MoE models not just for performance but also for practical deployment efficiency. This dual metric approach to model optimization challenges the prevailing emphasis on scaling up the number of experts and highlights the nuanced trade-offs between training expenditure and inference cost. These findings suggest a need for a more holistic consideration of efficiency in model optimization practices, extending beyond the conventional focus on loss minimization.

Looking forward, the research opens up several avenues for further exploration. One promising area is the deep dive into the mechanisms underpinning the diminishing returns observed with an increasing number of experts, potentially uncovering new insights into efficient model scaling practices. Additionally, the exploration of inference efficiency introduces a practical lens to model optimization that resonates with the operational realities of deploying LLMs at scale, warranting further investigation into cost-effective model architectures.

Conclusion

This paper enriches the discourse on MoE-based LLMs by meticulously analyzing the effects of the number of experts on both model performance and inference efficiency. In doing so, it proposes a novel perspective on training budget allocation that harmonizes model quality with operational efficiency, advocating for a balanced approach to model scaling. The insights offered not only contribute to the theoretical understanding of MoE models but also provide a pragmatic framework for optimizing these models in real-world applications. As the field of generative AI continues to evolve, such nuanced approaches to model optimization will be instrumental in harnessing the full potential of LLMs in a cost-effective manner.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  2. Unified scaling laws for routed language models. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  4057–4086. PMLR, 2022. URL https://proceedings.mlr.press/v162/clark22a.html.
  3. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
  4. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.  5547–5569. PMLR, 2022.
  5. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  6. Scaling laws for sparsely-connected foundation models. arXiv preprint arXiv:2309.08520, 2023.
  7. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  8. Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pp.  492–518. Springer, 1992.
  9. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  10. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  11. Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
  12. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  13. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp.  611–626, 2023.
  14. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  15. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528, 1989.
  16. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  17. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264, 2023.
  18. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15, 2021.
  19. Cheaply estimating inference efficiency metrics for autoregressive transformer models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  20. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  3505–3506, 2020.
  21. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws. arXiv preprint arXiv:2401.00448, 2023.
  22. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  23. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  24. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  25. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, June 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
  26. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  27. Barret Zoph. Designing effective sparse expert models. In IEEE International Parallel and Distributed Processing Symposium, IPDPS Workshops 2022, Lyon, France, May 30 - June 3, 2022, pp.  1044. IEEE, 2022. URL https://doi.org/10.1109/IPDPSW55747.2022.00171.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Longfei Yun (6 papers)
  2. Yonghao Zhuang (10 papers)
  3. Yao Fu (83 papers)
  4. Hao Zhang (947 papers)
  5. Eric P Xing (4 papers)
Citations (5)