Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance (2403.16952v1)
Abstract: Pretraining data of LLMs composes multiple domains (e.g., web texts, academic papers, codes), whose mixture proportions crucially impact the competence of outcome models. While existing endeavors rely on heuristics or qualitative strategies to tune the proportions, we discover the quantitative predictability of model performance regarding the mixture proportions in function forms, which we refer to as the data mixing laws. Fitting such functions on sample mixtures unveils model performance on unseen mixtures before actual runs, thus guiding the selection of an ideal data mixture. Furthermore, we propose nested use of the scaling laws of training steps, model sizes, and our data mixing law to enable predicting the performance of large models trained on massive data under various mixtures with only small-scale training. Moreover, experimental results verify that our method effectively optimizes the training mixture of a 1B model trained for 100B tokens in RedPajama, reaching a performance comparable to the one trained for 48% more steps on the default mixture. Extending the application of data mixing laws to continual training accurately predicts the critical mixture proportion that avoids catastrophic forgetting and outlooks the potential for dynamic data schedules
- Scaling laws for generative mixed-modal language models. arXiv preprint arXiv:2301.03728, 2023.
- Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022.
- Efficient online data mixing for language model pre-training. arXiv preprint arXiv:2312.02406, 2023.
- Four types of learning curves. Neural Computation, 4(4):605–618, 1992.
- Explaining neural scaling laws. arXiv preprint arXiv:2102.06701, 2021.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
- Christopher Bishop. Pattern recognition and machine learning. Springer google schola, 2:5–43, 2006.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Broken neural scaling laws. arXiv preprint arXiv:2210.14891, 2022.
- Data-juicer: A one-stop data processing system for large language models. In International Conference on Management of Data, 2024a.
- Skill-it! a data-driven skills framework for understanding and training language models. Advances in Neural Information Processing Systems, 36, 2024b.
- A primer on pretrained multilingual language models. arXiv preprint arXiv:2107.00676, 2021.
- Pedro Domingos. A unified bias-variance decomposition. In Proceedings of 17th international conference on machine learning, pp. 231–238. Morgan Kaufmann Stanford, 2000.
- Harris Drucker. Improving regressors using boosting techniques. In Icml, volume 97, pp. e115. Citeseer, 1997.
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp. 5547–5569. PMLR, 2022.
- Doge: Domain reweighting with generalization estimation, 2024.
- Scaling laws for sparsely-connected foundation models. arXiv preprint arXiv:2309.08520, 2023.
- Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
- Paul Friedl. Dis/similarities in the design and development of legal and algorithmic normative systems: the case of perspective api. Law, Innovation and Technology, 15(1):25–59, 2023.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Towards optimal learning of language models. arXiv preprint arXiv:2402.17759, 2024.
- Sample relationship from learning dynamics matters for generalisation. arXiv preprint arXiv:2401.08808, 2024.
- Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964, 2020.
- Scaling expert language models with unsupervised domain discovery. arXiv preprint arXiv:2303.14177, 2023.
- David A Harville. Decomposition of prediction error. Journal of the American Statistical Association, 80(389):132–138, 1985.
- Tatsunori Hashimoto. Model performance scaling with multiple data sources. In International Conference on Machine Learning, pp. 4107–4116. PMLR, 2021.
- David Haussler. Quantifying inductive bias: Ai learning algorithms and valiant’s learning framework. Artificial intelligence, 36(2):177–221, 1988.
- Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
- Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
- Marcus Hutter. Learning curve theory. arXiv preprint arXiv:2102.04074, 2021.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
- Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Scaling laws of rope-based extrapolation. In The Twelfth International Conference on Learning Representations, 2023.
- A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. arXiv preprint arXiv:2305.13169, 2023.
- An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023.
- Roger Mead. The design of experiments: statistical principles for practical applications. Cambridge university press, 1990.
- The quantization model of neural scaling. Advances in Neural Information Processing Systems, 36, 2024.
- Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pp. 15630–15649. PMLR, 2022.
- Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- Improving language understanding with unsupervised learning. 2018.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Balanced data sampling for language model training with clustering. arXiv preprint arXiv:2402.14526, 2024.
- Unraveling the mystery of scaling laws: Part i. arXiv preprint arXiv:2403.06563, 2024.
- Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
- VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264, 1971.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
- Pablo Villalobos. Scaling laws literature review, 2023. URL https://epochai.org/blog/scaling-laws-literature-review. Accessed: 2024-02-27.
- Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019.
- Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2024a.
- Data selection for language models via importance resampling. Advances in Neural Information Processing Systems, 36, 2024b.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
- Anygpt: Unified multimodal llm with discrete sequence modeling, 2024.
- Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
- Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964, 2023.
- Jiasheng Ye (8 papers)
- Peiju Liu (5 papers)
- Tianxiang Sun (35 papers)
- Yunhua Zhou (27 papers)
- Jun Zhan (16 papers)
- Xipeng Qiu (257 papers)