RegMix: Data Mixture as Regression for Language Model Pre-training (2407.01492v1)

Published 1 Jul 2024 in cs.CL and cs.AI

Abstract: The data mixture for LLM pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix involves training a set of small models with diverse data mixtures and fitting a regression model to predict their performance given their respective mixtures. With the fitted regression model, we simulate the top-ranked mixture and use it to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens of different mixtures to fit the regression model and find the optimal mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Further, our method demonstrates superior performance compared to human selection and achieves results that match or surpass DoReMi, while utilizing only 10% of the compute budget. Our experiments also show that (1) Data mixtures significantly impact performance with single-task performance variations of up to 14.6%; (2) Web corpora rather than data perceived as high-quality like Wikipedia have the strongest positive correlation with downstream performance; (3) Domains interact in complex ways often contradicting common sense, thus automatic approaches like RegMix are needed; (4) Data mixture effects transcend scaling laws, and our approach captures the complexity by considering all domains together. Our code is available at https://github.com/sail-sg/regmix.

PDF HTML Abstract

RegMix: Data Mixture as Regression for LLM Pre-training

The paper "RegMix: Data Mixture as Regression for LLM Pre-training," introduces an innovative approach, RegMix, to determine an effective data mixture automatically in LLM pre-training. The selection of an optimal data mixture has long been recognized as crucial to achieving strong performance in LLMs, yet previous methods have struggled with scalability and efficiency issues. This paper outlines the methodology of RegMix, analyzes its empirical performance, and discusses the broader implications of these findings.

Methodology

The RegMix approach is predicated on the hypothesis of rank invariance of data mixtures across model sizes and numbers of training tokens. The authors define the problem of data mixture selection as a regression task, where the goal is to predict the performance of various data mixtures using a regression model trained on small-scale proxy models.

Training Proxy Models: A set of small-scale proxy models is trained on different data mixtures sampled using a Dirichlet distribution. This sampling is carefully designed to cover diverse and extreme values, ensuring a broad representation of potential data mixtures.
Fitting the Regression Model: Performance metrics from these training runs serve as input features for fitting a regression model. Ridge regression and LightGBM regression are evaluated as methods for this task. The data mixture's domain weights act as the input features, while the validation loss on a target domain serves as the output.
Simulation and Prediction: The fitted regression model then predicts the performance of data mixtures across a larger simulation space. The best-performing mixtures are identified and averaged to create a robust data mixture for large-scale training.
Large-scale Model Training: Leveraging the identified optimal mixture, a large-scale LLM is trained, significantly larger than the proxy models used for the initial stages.

Empirical Validation

To empirically validate RegMix, the authors conducted extensive experiments, involving models with 1M, 60M, and 1B parameters. The experiments include training 512 models with 1M parameters for 1B tokens, and subsequently predicting the top-performing mixtures for large-scale models with 1B parameters trained on 25B tokens.

Key Findings:

Data Mixture Impact: The data mixture can significantly impact performance, with variations in single-task performance of up to 14.6%.
Web Corpora Effectiveness: Contrary to conventional expectations, web corpus data (e.g., CommonCrawl) exhibited a stronger positive correlation with downstream performance than traditionally high-quality data like Wikipedia.
Complex Domain Interactions: The paper revealed intricate and non-intuitive interactions between different data domains, highlighting the necessity of automated approaches like RegMix for data selection.
Transcendence of Scaling Laws: Data mixture effects were shown to transcend scaling laws, with the regression approach effectively capturing complex interactions among multiple domains.

Performance Comparison

RegMix's performance was benchmarked against human selection and DoReMi, a prominent data mixture selection method. Remarkably, RegMix not only matched but in some cases surpassed the performance of DoReMi while utilizing only 10% of the compute budget required by DoReMi.

Results:

Superior Average Performance: On average, models trained with RegMix outperformed those trained using human-selected mixtures and DoReMi mixtures.
Robustness Across Tasks: RegMix demonstrated robust performance across a wide range of benchmarks, beating human selection in 8 out of 14 tasks.

Implications and Future Directions

Efficiency in LLM Training: By reducing the compute cost associated with identifying optimal data mixtures, RegMix represents a more efficient approach. This has practical implications for the scalability of LLM training with constrained computational resources.
Policy for Data Selection: The findings suggest re-evaluating the perceived quality of data sources, given that web corpora have proven to correlate more strongly with downstream performance.
Beyond Manual Curation: The complexity of domain interactions exposed by this research indicates that manual curation of data mixtures may often result in suboptimal performance. Automated methods like RegMix are indispensable for deriving high-performance LLMs.
Future Work: Given the promising results, future research may explore extending RegMix to larger models, diverse datasets, and potentially integrating it with other optimization strategies for even more efficient and robust data selection.

Conclusion

RegMix constitutes a significant advancement in the automated selection of data mixtures for LLM training, balancing scalability, efficiency, and performance. By treating data mixture selection as a regression task, RegMix opens directions for further optimizing LLM pre-training and achieving superior downstream performance with reduced computational costs.

The code and methodology presented in this paper offer a pathway for researchers to build upon these findings, potentially advancing the state-of-the-art in LLM pre-training strategies.