Scaling Expert Language Models with Unsupervised Domain Discovery (2303.14177v1)

Published 24 Mar 2023 in cs.CL and cs.AI

Abstract: LLMs are typically trained densely: all parameters are updated with respect to all inputs. This requires synchronization of billions of parameters across thousands of GPUs. We introduce a simple but effective method to asynchronously train large, sparse LLMs on arbitrary text corpora. Our method clusters a corpus into sets of related documents, trains a separate expert LLM on each cluster, and combines them in a sparse ensemble for inference. This approach generalizes embarrassingly parallel training by automatically discovering the domains for each expert, and eliminates nearly all the communication overhead of existing sparse LLMs. Our technique outperforms dense baselines on multiple corpora and few-shot tasks, and our analysis shows that specializing experts to meaningful clusters is key to these gains. Performance also improves with the number of experts and size of training data, suggesting this is a highly efficient and accessible approach to training LLMs.

Citations (41)

View on Semantic Scholar

Summary

The paper presents a novel C-BTM framework that clusters large text corpora to train independent expert language models.
It employs k-means clustering for unsupervised domain discovery, eliminating reliance on pre-labeled data and minimizing GPU synchronization.
The method enhances scalability and efficiency, outperforming traditional dense training and MoE approaches in resource management and inference speed.

Introduction to Expert LLMs

LLMs (LMs) have grown exponentially in size and capability, requiring extensive resources and sophisticated training techniques. A recent method that has emerged to manage these complexities involves training smaller, domain-specific models, known as Expert LLMs (ELMs), that work together during inference.

Unsupervised Domain Discovery

Imagine breaking up a vast text corpus into related document clusters without relying on pre-existing labels or metadata. This is precisely the idea behind Cluster-Branch-Train-Merge (C-BTM), a novel approach that clusters a corpus into sets of related documents. Each cluster then becomes the training ground for its own expert LM. For inference, a sparse ensemble of these ELMs is activated, a process that significantly cuts down the synchronization of billions of parameters across multiple GPUs usually needed for large models.

The Inner Workings of C-BTM

How does C-BTM function? It begins with the unsupervised domain discovery using k-means clustering, a method that does not rely on pre-labeled data but instead uses inherent patterns within the text corpus to form clusters. Once clusters are established, each is assigned an ELM that is trained independently with a log-likelihood objective. For inference tasks, the outputs of ELMs are weighted and combined based on the distances between the cluster centers and the context embeddings, allowing the model to retrieve and activate only the most relevant ELMs.

Scaling Efficiency and Model Performance

When evaluating the technique's scalability, it was found that the number of clusters could be increased with data size, improving performance. This scalability allows ELMs to be trained in parallel, effectively using fewer resources without sacrificing output quality. Furthermore, during inference, the model maintains excellent performance even with a sparsified ensemble activation, indicating that training with C-BTM results in an effective sparse LM.

Advantages Over Dense Training and MoE

C-BTM's approach provides significant advantages over both dense models and other sparse models like Mixture-of-Experts (MoE). It achieves superior performance with reduced communication overhead and is resilient against the challenges faced by traditional large-scale model training. Moreover, it offers flexibility in job scheduling on shared GPU clusters, a welcome feature given the growing need for efficient resource management.

Implications for Future LLM Training

The introduction of C-BTM signifies a leap forward in the pursuit of training more capable LLMs without the traditional extensive resource requirements. Notably, its design that focuses on cluster-based training and inference reduces computational demands while still improving or maintaining model quality across various tasks. This technique holds promise for diverse applications and suggests an accessible pathway to training large LMs even with limited computing power.

PDF Markdown

Related Papers

GitHub

GitHub - kernelmachine/cbtm: Code repository for the c-BTM paper (106 stars)

Tweets

https://twitter.com/activewarp/status/1746980352126599444

https://twitter.com/concept_of_mind/status/1886132947842940934