Emergent Mind

Scaling Expert Language Models with Unsupervised Domain Discovery

(2303.14177)
Published Mar 24, 2023 in cs.CL and cs.AI

Abstract

Large language models are typically trained densely: all parameters are updated with respect to all inputs. This requires synchronization of billions of parameters across thousands of GPUs. We introduce a simple but effective method to asynchronously train large, sparse language models on arbitrary text corpora. Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference. This approach generalizes embarrassingly parallel training by automatically discovering the domains for each expert, and eliminates nearly all the communication overhead of existing sparse language models. Our technique outperforms dense baselines on multiple corpora and few-shot tasks, and our analysis shows that specializing experts to meaningful clusters is key to these gains. Performance also improves with the number of experts and size of training data, suggesting this is a highly efficient and accessible approach to training LLMs.

Introduction of c-BTM, a novel method for asynchronously scaling expert language models.

Overview

  • Expert Language Models (ELMs) are smaller, domain-specific models trained to reduce the complexity and resource requirements of LLMs.

  • The paper introduces Cluster-Branch-Train-Merge (C-BTM), a method for unsupervised domain discovery and training of ELMs from text clusters.

  • C-BTM uses k-means clustering for domain discovery, independently training each ELM and connecting them during inference based on context.

  • The scalability of the C-BTM technique allows for improved performance with parallel training and reduced resource usage.

  • C-BTM offers advantages over dense training and Mixture-of-Experts models, including less communication overhead, flexibility, and efficient resource management.

Introduction to Expert Language Models

Language models (LMs) have grown exponentially in size and capability, requiring extensive resources and sophisticated training techniques. A recent method that has emerged to manage these complexities involves training smaller, domain-specific models, known as Expert Language Models (ELMs), that work together during inference.

Unsupervised Domain Discovery

Imagine breaking up a vast text corpus into related document clusters without relying on pre-existing labels or metadata. This is precisely the idea behind Cluster-Branch-Train-Merge (C-BTM), a novel approach that clusters a corpus into sets of related documents. Each cluster then becomes the training ground for its own expert LM. For inference, a sparse ensemble of these ELMs is activated, a process that significantly cuts down the synchronization of billions of parameters across multiple GPUs usually needed for large models.

The Inner Workings of C-BTM

How does C-BTM function? It begins with the unsupervised domain discovery using k-means clustering, a method that does not rely on pre-labeled data but instead uses inherent patterns within the text corpus to form clusters. Once clusters are established, each is assigned an ELM that is trained independently with a log-likelihood objective. For inference tasks, the outputs of ELMs are weighted and combined based on the distances between the cluster centers and the context embeddings, allowing the model to retrieve and activate only the most relevant ELMs.

Scaling Efficiency and Model Performance

When evaluating the technique's scalability, it was found that the number of clusters could be increased with data size, improving performance. This scalability allows ELMs to be trained in parallel, effectively using fewer resources without sacrificing output quality. Furthermore, during inference, the model maintains excellent performance even with a sparsified ensemble activation, indicating that training with C-BTM results in an effective sparse LM.

Advantages Over Dense Training and MoE

C-BTM's approach provides significant advantages over both dense models and other sparse models like Mixture-of-Experts (MoE). It achieves superior performance with reduced communication overhead and is resilient against the challenges faced by traditional large-scale model training. Moreover, it offers flexibility in job scheduling on shared GPU clusters, a welcome feature given the growing need for efficient resource management.

Implications for Future Language Model Training

The introduction of C-BTM signifies a leap forward in the pursuit of training more capable language models without the traditional extensive resource requirements. Notably, its design that focuses on cluster-based training and inference reduces computational demands while still improving or maintaining model quality across various tasks. This technique holds promise for diverse applications and suggests an accessible pathway to training large LMs even with limited computing power.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.