Scaling Laws for Fine-Grained Mixture of Experts (2402.07871v1)

Published 12 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of LLMs. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity. Leveraging these laws, we derive the optimal training configuration for a given computational budget. Our findings not only show that MoE models consistently outperform dense Transformers but also highlight that the efficiency gap between dense and MoE models widens as we scale up the model size and training budget. Furthermore, we demonstrate that the common practice of setting the size of experts in MoE to mirror the feed-forward layer is not optimal at almost any computational budget.

PDF HTML Abstract

Unveiling the Efficiency of Fine-Grained Mixture of Experts in LLMs

Introduction to Mixture of Experts and Model Efficiency

The computational demands of training LLMs have escalated significantly, prompting a search for more efficient architectures. Mixture of Experts (MoE) models have emerged as a promising solution for reducing these demands. Our paper explores the scaling laws of fine-grained MoE by introducing granularity as a key hyperparameter to control the size of experts, thereby fine-tuning the computational efficiency and model performance.

Granularity: A New Perspective on MoE Configuration

The core contribution of this research lies in the introduction of the granularity parameter, fundamentally altering the conventional MoE configuration. Traditional MoE models typically equate the size of experts to that of the feed-forward layer within the model. We propose a variable approach, where granularity enables the determination of the optimal size of experts, thus opening avenues for computational savings without compromising performance.

Derivation of Scaling Laws Incorporating Granularity

Our analytical work leads to the derivation of novel scaling laws that take into account not only the model size and number of training tokens but also the introduced granularity parameter. This approach allows for a meticulous calculation of optimal training configurations across varying computational budgets. Empirically, we demonstrate that MoE models, when optimally configured, invariably surpass dense Transformers in terms of computational efficiency, particularly as the scale of model size and training budget increases.

Empirical Validation and Practical Implications

The practical aspects of our research are grounded in extensive experimental work, where we assess models ranging from 129M to 3.7B parameters. The granularity parameter, spanning logarithmically spaced values, reveals its crucial role in achieving optimal training efficiency. The implications are clear: adopting fine-grained MoE models not only leads to notable computational savings but also enhances the model's performance across a variety of configurations and computational budgets.

Future Horizons in MoE Model Optimization

Looking forward, our work lays the groundwork for further exploration into the intricate balance between model efficiency and performance within the domain of LLMs. The concept of granularity, alongside the newly established scaling laws, provides a framework for ongoing and future research efforts aimed at refining the architecture and training processes of MoE models.

Concluding Remarks

Our investigation into fine-grained MoE models establishes a nuanced understanding of their scaling properties and offers a methodological approach to optimizing their configuration. The overarching goal is to pave the way for more resource-efficient LLMs without sacrificing their formidable capabilities. As the field of artificial intelligence evolves, the insights garnered from this paper beckon a recalibration of how we approach the computational challenges inherent in training large-scale models.