Efficient Online Data Mixing For Language Model Pre-Training (2312.02406v2)

Published 5 Dec 2023 in cs.CL and cs.LG

Abstract: The data used to pretrain LLMs has a decisive impact on a model's downstream performance, which has led to a large body of work on data selection methods that aim to automatically determine the most suitable data to use for pretraining. Existing data selection methods suffer from slow and computationally expensive processes, a problem amplified by the increasing size of models and of pretraining datasets. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together and determining sampling probabilities across entire groups. However, data mixing proportions are typically fixed before training and therefore cannot adapt to changing training dynamics. To address these limitations, we develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing. Based on multi-armed bandit algorithms, our online approach optimizes the data mixing proportions during training. Remarkably, our method trains a model that reaches the final perplexity of the next best method with 19\% fewer training iterations, and improves performance on the 5-shot MMLU benchmark by 1.9% relative accuracy, while adding negligible wall-clock time during pretraining.

Citations (21)

View on Semantic Scholar

Summary

The paper demonstrates that ODM dynamically adjusts data mixing proportions by treating data selection as a multi-armed bandit problem.
The use of loss as a dynamic reward signal enabled the model to reach similar performance in 19% fewer training iterations.
ODM improved downstream capabilities in classification tasks, showcasing its efficiency over static data mixing strategies.

The performance of LLMs is significantly influenced by the pretraining data used. A common challenge in optimizing this pretraining process is selecting high-quality data without incurring substantial computational costs. In pretraining, LLMs typically use datasets comprising different domains, and a crucial decision is how frequently to sample from each domain to achieve the best model performance—this is known as data mixing. Whereas traditional fixed-proportion data mixing strategies don't adapt to the model's evolving needs during training, a new approach named Online Data Mixing (ODM) bridges this gap.

ODM is a novel technique that adjusts the proportions of data mixing in real-time during model pretraining. The model views data selection as a multi-armed bandit problem—each domain of data is considered an 'arm' that can be 'pulled' to optimize for the most informative data. The method uses the loss associated with each domain as a dynamic reward signal, guiding the algorithm to favor data that offers the highest potential information gain. Remarkably, ODM was shown to reach similar levels of model performance in 19% fewer training iterations than the next best method, with minimal impact on the total training time.

To test ODM's efficacy, researchers trained an LLM with 1 billion parameters on The Pile dataset, which comprises data from 22 diverse domains. They compared their online mixing strategy with several static data mixing methods, including The Pile Weights and the DoReMi algorithm, across various performance metrics. ODM not only improved model performance regarding validation perplexity but also enhanced the model's downstream capabilities, leading to more accurate results in a 5-shot multiple-choice classification task.

The analysis revealed some intriguing insights. Different domains in the dataset contributed to model learning unevenly, with some domains' data becoming less significant over time, while others gained importance during the course of training. This dynamic adjustment illustrates ODM's ability to respond to the model's learning trajectory, emphasizing its potential to enhance the efficiency and effectiveness of the pretraining process for LLMs.

In summary, the ODM algorithm provides a compelling way forward for the training of LLMs. By dynamically adjusting the sampling of training data to maximize the model's ongoing learning, ODM outperforms static data mixing strategies, leading to better models trained more efficiently. This development marks a significant step toward more intelligently and cost-effectively training the advanced AI models that are increasingly central to various technology applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AlbalakAlon/status/1747446833465713009