Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance (2403.16952v1)

Published 25 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Pretraining data of LLMs composes multiple domains (e.g., web texts, academic papers, codes), whose mixture proportions crucially impact the competence of outcome models. While existing endeavors rely on heuristics or qualitative strategies to tune the proportions, we discover the quantitative predictability of model performance regarding the mixture proportions in function forms, which we refer to as the data mixing laws. Fitting such functions on sample mixtures unveils model performance on unseen mixtures before actual runs, thus guiding the selection of an ideal data mixture. Furthermore, we propose nested use of the scaling laws of training steps, model sizes, and our data mixing law to enable predicting the performance of large models trained on massive data under various mixtures with only small-scale training. Moreover, experimental results verify that our method effectively optimizes the training mixture of a 1B model trained for 100B tokens in RedPajama, reaching a performance comparable to the one trained for 48% more steps on the default mixture. Extending the application of data mixing laws to continual training accurately predicts the critical mixture proportion that avoids catastrophic forgetting and outlooks the potential for dynamic data schedules

References (68)

Authors (6)

Jiasheng Ye (8 papers)
Peiju Liu (5 papers)
Tianxiang Sun (35 papers)
Yunhua Zhou (27 papers)
Jun Zhan (16 papers)
Xipeng Qiu (257 papers)

Citations (28)

View on Semantic Scholar

Summary

Data Mixing Laws: A Quantitative Framework for Optimizing Training Data Mixtures in LLMs

Introduction

The crafting of pretraining datasets for LLMs involves assembling text from diverse domains, each influencing the model's abilities in intricate ways. Despite existing heuristic approaches toward balancing these mixtures, a more precise and quantitative method to predict and optimize the influence of data mixture proportions on model performance has remained elusive. This paper introduces a novel quantitative framework, termed "data mixing laws," which predicts how variations in data mixture proportions affect LLM performance. Moreover, we propose a pipeline utilizing nested utilization of existing scaling laws alongside our data mixing laws, allowing for accurate performance predictions for LLMs trained on voluminous data at reduced computational costs.

Discovery of Data Mixing Laws

The paper begins by examining the relationship between training data mixture proportions and model validation loss in simpler scenarios—mixtures of two domains, extending to multiple domains, and further considering various forms of validation sets. A significant takeaway is the finding that validation losses across different domains and mixtures can be predicted by an exponential function over the linear combination of mixture proportions, hinting at the presence of a quantifiable law that governs this relationship. These laws enable predictions of LLM performance for given data mixtures without necessitating exhaustive empirical training.

Nested Application of Scaling Laws

Addressing the challenge of fitting data mixing laws, which inherently requires multiple training runs across varied mixtures at the target model scale, we advocate a method that capitalizes on established scaling laws related to model sizes and training steps. By conducting initial experiments at affordable scales, the pipeline we propose can remarkably predict the outcomes of large-scale training on diverse mixtures, significantly reducing the necessity for computationally expensive training endeavors.

Experiments and Results

Our experiments reinforce the reliability and utility of the data mixing laws and the nested scaling law pipeline. An optimization exercise for the training mixture of a 1B model over 100B tokens using RedPajama demonstrates our approach's efficacy, achieving substantial performance gains. Moreover, the application of data mixing laws to continual pretraining suggests a promising direction towards devising dynamic data schedules, underlining potential broader impacts on LLM training strategies.

Implications and Future Directions

This research presents a pioneering step towards a quantitative understanding of the influence of training data mixtures on LLM performance. The introduction of data mixing laws, complemented by a practical prediction pipeline, marks a significant advance towards more informed, efficient, and strategic LLM pretraining practices. Looking ahead, this framework paves the way for refined data curation methods including dynamic data scheduling, and beckons a deeper theoretical investigation into the interaction between data mixtures and learning dynamics in LLMs. The exploration into more nuanced and operationally defined domain conceptions, alongside a theoretical underpinning of our empirical findings, represents crucial avenues for future work.

Conclusion

The quantitative framework developed in this paper offers a novel lens through which the impact of data mixtures on LLM pretraining can be predicted and optimized. By laying down the foundations of data mixing laws and demonstrating their practical applications through a nested scaling law pipeline, this work advocates a more nuanced and informed approach to LLM data curation, pointing towards an era of more efficient and purposive model training strategies.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - yegcjs/mixinglaws (101 stars)

Tweets

https://twitter.com/AlbalakAlon/status/1773040910047023510

https://twitter.com/knishimae0531/status/1772766528498483385

https://twitter.com/gm8xx8/status/1772459182655992211