- The paper demonstrates that ODM dynamically adjusts data mixing proportions by treating data selection as a multi-armed bandit problem.
- The use of loss as a dynamic reward signal enabled the model to reach similar performance in 19% fewer training iterations.
- ODM improved downstream capabilities in classification tasks, showcasing its efficiency over static data mixing strategies.
The performance of LLMs is significantly influenced by the pretraining data used. A common challenge in optimizing this pretraining process is selecting high-quality data without incurring substantial computational costs. In pretraining, LLMs typically use datasets comprising different domains, and a crucial decision is how frequently to sample from each domain to achieve the best model performance—this is known as data mixing. Whereas traditional fixed-proportion data mixing strategies don't adapt to the model's evolving needs during training, a new approach named Online Data Mixing (ODM) bridges this gap.
ODM is a novel technique that adjusts the proportions of data mixing in real-time during model pretraining. The model views data selection as a multi-armed bandit problem—each domain of data is considered an 'arm' that can be 'pulled' to optimize for the most informative data. The method uses the loss associated with each domain as a dynamic reward signal, guiding the algorithm to favor data that offers the highest potential information gain. Remarkably, ODM was shown to reach similar levels of model performance in 19% fewer training iterations than the next best method, with minimal impact on the total training time.
To test ODM's efficacy, researchers trained an LLM with 1 billion parameters on The Pile dataset, which comprises data from 22 diverse domains. They compared their online mixing strategy with several static data mixing methods, including The Pile Weights and the DoReMi algorithm, across various performance metrics. ODM not only improved model performance regarding validation perplexity but also enhanced the model's downstream capabilities, leading to more accurate results in a 5-shot multiple-choice classification task.
The analysis revealed some intriguing insights. Different domains in the dataset contributed to model learning unevenly, with some domains' data becoming less significant over time, while others gained importance during the course of training. This dynamic adjustment illustrates ODM's ability to respond to the model's learning trajectory, emphasizing its potential to enhance the efficiency and effectiveness of the pretraining process for LLMs.
In summary, the ODM algorithm provides a compelling way forward for the training of LLMs. By dynamically adjusting the sampling of training data to maximize the model's ongoing learning, ODM outperforms static data mixing strategies, leading to better models trained more efficiently. This development marks a significant step toward more intelligently and cost-effectively training the advanced AI models that are increasingly central to various technology applications.