Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training (2405.03133v2)

Published 6 May 2024 in cs.CL and cs.LG

Abstract: Mixture-of-experts (MoE) models facilitate efficient scaling; however, training the router network introduces the challenge of optimizing a non-differentiable, discrete objective. Recently, a fully-differentiable MoE architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges experts in the parameter space; nevertheless, its effectiveness was only demonstrated in downstream fine-tuning on classification tasks. In this paper, we present Lory, the first approach that scales such architectures to autoregressive LLM pre-training. Lory introduces two key techniques: (1) a causal segment routing strategy that achieves high efficiency for expert merging operations while preserving the autoregressive nature of LLMs; (2) a similarity-based data batching method that encourages expert specialization by grouping similar documents in training instances. We pre-train a series of Lory models on 150B tokens from scratch, with up to 32 experts and 30B (1.5B active) parameters. Experimental results show significant performance gains over parameter-matched dense models on both perplexity (+13.9%) and a variety of downstream tasks (+1.5%-11.1%). Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing. We further demonstrate that the trained experts in Lory capture domain-level specialization without supervision. Our work highlights the potential of fully-differentiable MoE architectures for LLM pre-training and advocates future research in this area.

PDF Abstract

Exploring "Lory": A Fully-Differentiable Mixture-of-Experts for LLMs

Introducing the Lory Model

When we talk about scaling AI models, particularly LLMs, the challenge often lies in managing the computational cost while improving the model's performance. This is where Mixture-of-Experts (MoE) architectures step in, allowing growth in model size without a proportional increase in computation.

However, traditional MoE models struggle with optimizing a non-differentiable, discrete objective posed by the training of the routing network. Enter "Lory", a novel approach that introduces a fully-differentiable MoE architecture suitable for autoregressive LLM pre-training.

Core Innovations in Lory

The Lory model introduces two pivotal techniques:

Causal Segment Routing: This strategy involves dividing the input sequence into segments. For each segment, the model determines which expert to use based on the information from the previous segment. During inference, Lory simplifies the process by allowing a single routing decision based on the input prompt, enhancing efficiency.
Similarity-based Data Batching: By grouping similar documents during training, this method enhances the model's ability to route and process semantically similar information, promoting expert specialization.

Training and Performance

Lory was trained from scratch on a dataset comprising 150 billion tokens, with model sizes varying up to 30 billion parameters. The results have been quite promising:

On perplexity, a measure of model prediction uncertainty, Lory outperformed parameter-matched dense models significantly (approximately 13.9% improvement).
For downstream tasks, which include a diverse set from reading comprehension to text classification, the performance boost ranged from 1.5% to 11.1%.

Importantly, despite using segment-level routing, Lory achieved competitive performance against state-of-the-art MoE models that use more granular (but computationally expensive) token-level routing.

Theoretical and Practical Implications

The research demonstrates several key implications:

Specialization without Supervision: Lory's experts developed domain-level specialization independently, a trait not prominently seen in traditional MoE approaches that tend to focus on superficial token-level patterns.
Scalability with Fully Differentiable Architecture: By replacing non-differentiable components, Lory simplifies the training process and opens up possibilities for more scalable and efficient model training regimes.
Efficiency in Inference: The approach to make a single routing decision during inference parallels the simplicity and computational efficiency of dense models, making Lory practical for real-world applications where resources might be constrained.

Looking Forward

The success of Lory suggests a potent future for fully-differentiable MoE architectures in LLM pre-training. For future directions, the exploration could extend to combining Lory’s segment-level routing with token-level strategies or advancing the model’s capabilities for even more specialized tasks.

Moreover, the principle of Lory could potentially be translated to other forms of AI outside NLP, wherever MoE architectures can be beneficial. Developments in these areas would further underline the versatility and utility of fully-differentiable MoE systems in modern AI solutions.