RegMix: Data Mixture as Regression for LLM Pre-training
The paper "RegMix: Data Mixture as Regression for LLM Pre-training," introduces an innovative approach, RegMix, to determine an effective data mixture automatically in LLM pre-training. The selection of an optimal data mixture has long been recognized as crucial to achieving strong performance in LLMs, yet previous methods have struggled with scalability and efficiency issues. This paper outlines the methodology of RegMix, analyzes its empirical performance, and discusses the broader implications of these findings.
Methodology
The RegMix approach is predicated on the hypothesis of rank invariance of data mixtures across model sizes and numbers of training tokens. The authors define the problem of data mixture selection as a regression task, where the goal is to predict the performance of various data mixtures using a regression model trained on small-scale proxy models.
- Training Proxy Models: A set of small-scale proxy models is trained on different data mixtures sampled using a Dirichlet distribution. This sampling is carefully designed to cover diverse and extreme values, ensuring a broad representation of potential data mixtures.
- Fitting the Regression Model: Performance metrics from these training runs serve as input features for fitting a regression model. Ridge regression and LightGBM regression are evaluated as methods for this task. The data mixture's domain weights act as the input features, while the validation loss on a target domain serves as the output.
- Simulation and Prediction: The fitted regression model then predicts the performance of data mixtures across a larger simulation space. The best-performing mixtures are identified and averaged to create a robust data mixture for large-scale training.
- Large-scale Model Training: Leveraging the identified optimal mixture, a large-scale LLM is trained, significantly larger than the proxy models used for the initial stages.
Empirical Validation
To empirically validate RegMix, the authors conducted extensive experiments, involving models with 1M, 60M, and 1B parameters. The experiments include training 512 models with 1M parameters for 1B tokens, and subsequently predicting the top-performing mixtures for large-scale models with 1B parameters trained on 25B tokens.
Key Findings:
- Data Mixture Impact: The data mixture can significantly impact performance, with variations in single-task performance of up to 14.6%.
- Web Corpora Effectiveness: Contrary to conventional expectations, web corpus data (e.g., CommonCrawl) exhibited a stronger positive correlation with downstream performance than traditionally high-quality data like Wikipedia.
- Complex Domain Interactions: The paper revealed intricate and non-intuitive interactions between different data domains, highlighting the necessity of automated approaches like RegMix for data selection.
- Transcendence of Scaling Laws: Data mixture effects were shown to transcend scaling laws, with the regression approach effectively capturing complex interactions among multiple domains.
Performance Comparison
RegMix's performance was benchmarked against human selection and DoReMi, a prominent data mixture selection method. Remarkably, RegMix not only matched but in some cases surpassed the performance of DoReMi while utilizing only 10% of the compute budget required by DoReMi.
Results:
- Superior Average Performance: On average, models trained with RegMix outperformed those trained using human-selected mixtures and DoReMi mixtures.
- Robustness Across Tasks: RegMix demonstrated robust performance across a wide range of benchmarks, beating human selection in 8 out of 14 tasks.
Implications and Future Directions
- Efficiency in LLM Training: By reducing the compute cost associated with identifying optimal data mixtures, RegMix represents a more efficient approach. This has practical implications for the scalability of LLM training with constrained computational resources.
- Policy for Data Selection: The findings suggest re-evaluating the perceived quality of data sources, given that web corpora have proven to correlate more strongly with downstream performance.
- Beyond Manual Curation: The complexity of domain interactions exposed by this research indicates that manual curation of data mixtures may often result in suboptimal performance. Automated methods like RegMix are indispensable for deriving high-performance LLMs.
- Future Work: Given the promising results, future research may explore extending RegMix to larger models, diverse datasets, and potentially integrating it with other optimization strategies for even more efficient and robust data selection.
Conclusion
RegMix constitutes a significant advancement in the automated selection of data mixtures for LLM training, balancing scalability, efficiency, and performance. By treating data mixture selection as a regression task, RegMix opens directions for further optimizing LLM pre-training and achieving superior downstream performance with reduced computational costs.
The code and methodology presented in this paper offer a pathway for researchers to build upon these findings, potentially advancing the state-of-the-art in LLM pre-training strategies.