AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs (2407.20177v3)

Published 29 Jul 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Domain reweighting is an emerging research area aimed at adjusting the relative weights of different data sources to improve the effectiveness and efficiency of LLM pre-training. This paper demonstrates that the optimal composition of training data from different domains is scale-dependent, challenging the existing practice of determining optimal mixtures through small-scale experiments and directly applying them at larger scales. We derive an analytical model for the dependence of optimal weights on data scale and introduce AutoScale, a novel, practical approach for optimizing data compositions at potentially large training data scales. AutoScale first uses a principled optimization framework to find optimal compositions at smaller, feasible scales, then predicts optimal compositions at larger scales using our derived model. Our evaluation on GPT-2 Large and BERT pre-training demonstrates AutoScale's effectiveness in improving training convergence and downstream performance. Particularly, for GPT-2 Large on RedPajama, AutoScale decreases validation perplexity 28% faster than baselines, with up to 38% speed-up over unweighted training, achieving the best performance across downstream tasks. This work provides insights into the varying benefits of data sources across training scales for LLMs, contributing to the burgeoning research on scale-dependent data curation. Code is open-sourced.

View on arXiv

Authors (7)

Feiyang Kang (9 papers)
Yifan Sun (183 papers)
Bingbing Wen (11 papers)
Si Chen (83 papers)
Dawn Song (229 papers)
Rafid Mahmood (20 papers)
Ruoxi Jia (88 papers)

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs (2407.20177v3)

Summary

Related Papers