Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance (2403.16952v1)

Published 25 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Pretraining data of LLMs composes multiple domains (e.g., web texts, academic papers, codes), whose mixture proportions crucially impact the competence of outcome models. While existing endeavors rely on heuristics or qualitative strategies to tune the proportions, we discover the quantitative predictability of model performance regarding the mixture proportions in function forms, which we refer to as the data mixing laws. Fitting such functions on sample mixtures unveils model performance on unseen mixtures before actual runs, thus guiding the selection of an ideal data mixture. Furthermore, we propose nested use of the scaling laws of training steps, model sizes, and our data mixing law to enable predicting the performance of large models trained on massive data under various mixtures with only small-scale training. Moreover, experimental results verify that our method effectively optimizes the training mixture of a 1B model trained for 100B tokens in RedPajama, reaching a performance comparable to the one trained for 48% more steps on the default mixture. Extending the application of data mixing laws to continual training accurately predicts the critical mixture proportion that avoids catastrophic forgetting and outlooks the potential for dynamic data schedules

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Scaling laws for generative mixed-modal language models. arXiv preprint arXiv:2301.03728, 2023.
  2. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022.
  3. Efficient online data mixing for language model pre-training. arXiv preprint arXiv:2312.02406, 2023.
  4. Four types of learning curves. Neural Computation, 4(4):605–618, 1992.
  5. Explaining neural scaling laws. arXiv preprint arXiv:2102.06701, 2021.
  6. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  7. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  8. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023.
  9. Christopher Bishop. Pattern recognition and machine learning. Springer google schola, 2:5–43, 2006.
  10. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  11. Broken neural scaling laws. arXiv preprint arXiv:2210.14891, 2022.
  12. Data-juicer: A one-stop data processing system for large language models. In International Conference on Management of Data, 2024a.
  13. Skill-it! a data-driven skills framework for understanding and training language models. Advances in Neural Information Processing Systems, 36, 2024b.
  14. A primer on pretrained multilingual language models. arXiv preprint arXiv:2107.00676, 2021.
  15. Pedro Domingos. A unified bias-variance decomposition. In Proceedings of 17th international conference on machine learning, pp.  231–238. Morgan Kaufmann Stanford, 2000.
  16. Harris Drucker. Improving regressors using boosting techniques. In Icml, volume 97, pp.  e115. Citeseer, 1997.
  17. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.  5547–5569. PMLR, 2022.
  18. Doge: Domain reweighting with generalization estimation, 2024.
  19. Scaling laws for sparsely-connected foundation models. arXiv preprint arXiv:2309.08520, 2023.
  20. Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  21. Paul Friedl. Dis/similarities in the design and development of legal and algorithmic normative systems: the case of perspective api. Law, Innovation and Technology, 15(1):25–59, 2023.
  22. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  23. Towards optimal learning of language models. arXiv preprint arXiv:2402.17759, 2024.
  24. Sample relationship from learning dynamics matters for generalisation. arXiv preprint arXiv:2401.08808, 2024.
  25. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964, 2020.
  26. Scaling expert language models with unsupervised domain discovery. arXiv preprint arXiv:2303.14177, 2023.
  27. David A Harville. Decomposition of prediction error. Journal of the American Statistical Association, 80(389):132–138, 1985.
  28. Tatsunori Hashimoto. Model performance scaling with multiple data sources. In International Conference on Machine Learning, pp.  4107–4116. PMLR, 2021.
  29. David Haussler. Quantifying inductive bias: Ai learning algorithms and valiant’s learning framework. Artificial intelligence, 36(2):177–221, 1988.
  30. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  31. Scaling laws for transfer. arXiv preprint arXiv:2102.01293, 2021.
  32. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
  33. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  34. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  35. Marcus Hutter. Learning curve theory. arXiv preprint arXiv:2102.04074, 2021.
  36. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  37. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  38. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  39. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.
  40. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  41. Scaling laws of rope-based extrapolation. In The Twelfth International Conference on Learning Representations, 2023.
  42. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. arXiv preprint arXiv:2305.13169, 2023.
  43. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023.
  44. Roger Mead. The design of experiments: statistical principles for practical applications. Cambridge university press, 1990.
  45. The quantization model of neural scaling. Advances in Neural Information Processing Systems, 36, 2024.
  46. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pp.  15630–15649. PMLR, 2022.
  47. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
  48. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  49. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  50. Improving language understanding with unsupervised learning. 2018.
  51. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  52. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  53. Balanced data sampling for language model training with clustering. arXiv preprint arXiv:2402.14526, 2024.
  54. Unraveling the mystery of scaling laws: Part i. arXiv preprint arXiv:2403.06563, 2024.
  55. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
  56. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  57. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  58. Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
  59. VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264, 1971.
  60. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  61. Pablo Villalobos. Scaling laws literature review, 2023. URL https://epochai.org/blog/scaling-laws-literature-review. Accessed: 2024-02-27.
  62. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019.
  63. Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2024a.
  64. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems, 36, 2024b.
  65. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  66. Anygpt: Unified multimodal llm with discrete sequence modeling, 2024.
  67. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
  68. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jiasheng Ye (8 papers)
  2. Peiju Liu (5 papers)
  3. Tianxiang Sun (35 papers)
  4. Yunhua Zhou (27 papers)
  5. Jun Zhan (16 papers)
  6. Xipeng Qiu (257 papers)
Citations (28)

Summary

Data Mixing Laws: A Quantitative Framework for Optimizing Training Data Mixtures in LLMs

Introduction

The crafting of pretraining datasets for LLMs involves assembling text from diverse domains, each influencing the model's abilities in intricate ways. Despite existing heuristic approaches toward balancing these mixtures, a more precise and quantitative method to predict and optimize the influence of data mixture proportions on model performance has remained elusive. This paper introduces a novel quantitative framework, termed "data mixing laws," which predicts how variations in data mixture proportions affect LLM performance. Moreover, we propose a pipeline utilizing nested utilization of existing scaling laws alongside our data mixing laws, allowing for accurate performance predictions for LLMs trained on voluminous data at reduced computational costs.

Discovery of Data Mixing Laws

The paper begins by examining the relationship between training data mixture proportions and model validation loss in simpler scenarios—mixtures of two domains, extending to multiple domains, and further considering various forms of validation sets. A significant takeaway is the finding that validation losses across different domains and mixtures can be predicted by an exponential function over the linear combination of mixture proportions, hinting at the presence of a quantifiable law that governs this relationship. These laws enable predictions of LLM performance for given data mixtures without necessitating exhaustive empirical training.

Nested Application of Scaling Laws

Addressing the challenge of fitting data mixing laws, which inherently requires multiple training runs across varied mixtures at the target model scale, we advocate a method that capitalizes on established scaling laws related to model sizes and training steps. By conducting initial experiments at affordable scales, the pipeline we propose can remarkably predict the outcomes of large-scale training on diverse mixtures, significantly reducing the necessity for computationally expensive training endeavors.

Experiments and Results

Our experiments reinforce the reliability and utility of the data mixing laws and the nested scaling law pipeline. An optimization exercise for the training mixture of a 1B model over 100B tokens using RedPajama demonstrates our approach's efficacy, achieving substantial performance gains. Moreover, the application of data mixing laws to continual pretraining suggests a promising direction towards devising dynamic data schedules, underlining potential broader impacts on LLM training strategies.

Implications and Future Directions

This research presents a pioneering step towards a quantitative understanding of the influence of training data mixtures on LLM performance. The introduction of data mixing laws, complemented by a practical prediction pipeline, marks a significant advance towards more informed, efficient, and strategic LLM pretraining practices. Looking ahead, this framework paves the way for refined data curation methods including dynamic data scheduling, and beckons a deeper theoretical investigation into the interaction between data mixtures and learning dynamics in LLMs. The exploration into more nuanced and operationally defined domain conceptions, alongside a theoretical underpinning of our empirical findings, represents crucial avenues for future work.

Conclusion

The quantitative framework developed in this paper offers a novel lens through which the impact of data mixtures on LLM pretraining can be predicted and optimized. By laying down the foundations of data mixing laws and demonstrating their practical applications through a nested scaling law pipeline, this work advocates a more nuanced and informed approach to LLM data curation, pointing towards an era of more efficient and purposive model training strategies.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub