Toward Inference-optimal Mixture-of-Expert Large Language Models (2404.02852v1)

Published 3 Apr 2024 in cs.LG

Abstract: Mixture-of-Expert (MoE) based LLMs, such as the recent Mixtral and DeepSeek-MoE, have shown great promise in scaling model size without suffering from the quadratic growth of training cost of dense transformers. Like dense models, training MoEs requires answering the same question: given a training budget, what is the optimal allocation on the model size and number of tokens? We study the scaling law of MoE-based LLMs regarding the relations between the model performance, model size, dataset size, and the expert degree. Echoing previous research studying MoE in different contexts, we observe the diminishing return of increasing the number of experts, but this seems to suggest we should scale the number of experts until saturation, as the training cost would remain constant, which is problematic during inference time. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss. We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. On the other hand, training a (16/32) expert MoE much smaller (70-85%) than the loss-optimal solution, but with a larger training dataset is a promising setup under a training budget.

PDF HTML Abstract

Optimizing Training and Inference in MoE-based LLMs: A Novel Budget Allocation Perspective

Introduction

Mixture of Experts (MoE) models have increasingly become a focal point in the advancement of LLMs, offering a scalable alternative that promises significant performance improvements without a proportional increase in computational costs. The core of MoE architecture lies in its ability to route input tokens to a subset of experts, thus leveraging a more extensive network capacity efficiently. Despite the apparent advantages, the optimal number of experts remains a critical, yet not fully understood parameter. This paper contributes to the ongoing discourse by investigating the optimal number of experts within MoE models in relation to model size, dataset size, expert degree, and importantly, considering inference efficiency alongside validation loss.

Scaling Behavior and Inference Efficiency of MoE Models

The research meticulously extends the existing scaling law framework to include MoE models, examining how validation loss scales with model size, dataset size, and the number of experts. It discovers a power-law relationship among these factors and validation loss, aligning with previous findings in the domain. Crucially, it identifies a diminishing return upon increasing the number of experts, a phenomenon that encourages the exploration of efficiency beyond mere loss optimization.

The inference efficiency, characterized through cost per token, emerges as a pivotal concern, altering the previously loss-centric view of model optimization. The analysis reveals that while MoE models with a smaller number of experts (4 or 8) demonstrate superior serving efficiency under equivalent performance conditions, they incur significantly higher training costs. Conversely, configuring an MoE model with a higher number of experts (16/32) but smaller than the loss-optimal size and compensating with a larger training dataset presents a cost-effective strategy under fixed training budgets.

Implications and Future Directions

The paper underscores a strategic pivot in optimizing MoE models not just for performance but also for practical deployment efficiency. This dual metric approach to model optimization challenges the prevailing emphasis on scaling up the number of experts and highlights the nuanced trade-offs between training expenditure and inference cost. These findings suggest a need for a more holistic consideration of efficiency in model optimization practices, extending beyond the conventional focus on loss minimization.

Looking forward, the research opens up several avenues for further exploration. One promising area is the deep dive into the mechanisms underpinning the diminishing returns observed with an increasing number of experts, potentially uncovering new insights into efficient model scaling practices. Additionally, the exploration of inference efficiency introduces a practical lens to model optimization that resonates with the operational realities of deploying LLMs at scale, warranting further investigation into cost-effective model architectures.

Conclusion

This paper enriches the discourse on MoE-based LLMs by meticulously analyzing the effects of the number of experts on both model performance and inference efficiency. In doing so, it proposes a novel perspective on training budget allocation that harmonizes model quality with operational efficiency, advocating for a balanced approach to model scaling. The insights offered not only contribute to the theoretical understanding of MoE models but also provide a pragmatic framework for optimizing these models in real-world applications. As the field of generative AI continues to evolve, such nuanced approaches to model optimization will be instrumental in harnessing the full potential of LLMs in a cost-effective manner.