Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws (2410.11820v1)

Published 15 Oct 2024 in cs.LG

Abstract: The composition of pretraining data is a key determinant of foundation models' performance, but there is no standard guideline for allocating a limited computational budget across different data sources. Most current approaches either rely on extensive experiments with smaller models or dynamic data adjustments that also require proxy models, both of which significantly increase the workflow complexity and computational overhead. In this paper, we introduce Adaptive Data Optimization (ADO), an algorithm that optimizes data distributions in an online fashion, concurrent with model training. Unlike existing techniques, ADO does not require external knowledge, proxy models, or modifications to the model update. Instead, ADO uses per-domain scaling laws to estimate the learning potential of each domain during training and adjusts the data mixture accordingly, making it more scalable and easier to integrate. Experiments demonstrate that ADO can achieve comparable or better performance than prior methods while maintaining computational efficiency across different computation scales, offering a practical solution for dynamically adjusting data distribution without sacrificing flexibility or increasing costs. Beyond its practical benefits, ADO also provides a new perspective on data collection strategies via scaling laws.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
  2. Ranking via sinkhorn propagation. arXiv preprint arXiv:1106.1925, 2011.
  3. Efficient online data mixing for language model pre-training. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023.
  4. A survey on data selection for language models. arXiv preprint arXiv:2402.16827, 2024.
  5. Discrimination transfer along a pitch continuum. Journal of Experimental Psychology, 48(4):241, 1954.
  6. Deep learning through the lens of example difficulty. Advances in Neural Information Processing Systems, 34:10876–10889, 2021.
  7. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp.  41–48, 2009.
  8. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
  9. Piqa: Reasoning about physical commonsense in natural language. In AAAI Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:208290939.
  10. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
  11. On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021. URL https://arxiv.org/abs/2108.07258.
  12. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  13. Skill-it! a data-driven skills framework for understanding and training language models. Advances in Neural Information Processing Systems, 36, 2024.
  14. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018. URL https://api.semanticscholar.org/CorpusID:3922816.
  15. The DeepMind JAX Ecosystem, 2020. URL http://github.com/google-deepmind.
  16. The road less scheduled. arXiv preprint arXiv:2405.15682, 2024.
  17. Dsdm: Model-aware dataset selection with datamodels. arXiv preprint arXiv:2401.12926, 2024.
  18. Bad students make great teachers: Active learning accelerates large-scale visual understanding. arXiv preprint arXiv:2312.05328, 2023.
  19. Doge: Domain reweighting with generalization estimation. ArXiv, abs/2310.15393, 2023. URL https://api.semanticscholar.org/CorpusID:264439382.
  20. Reverse curriculum generation for reinforcement learning. ArXiv, abs/1707.05300, 2017. URL https://api.semanticscholar.org/CorpusID:19181872.
  21. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024.
  22. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  23. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/12608602.
  24. Data mixing made efficient: A bivariate scaling law for language model pretraining. arXiv preprint arXiv:2405.14908, 2024.
  25. Charles Albert Eric Goodhart. Monetary control—the british experience. In Monetary Conditions for Economic Recovery, pp.  59–84. Springer, 1985.
  26. Scaling laws for data filtering–data curation cannot be compute agnostic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22702–22711, 2024.
  27. Automated curriculum learning for neural networks. In international conference on machine learning, pp. 1311–1320. Pmlr, 2017.
  28. Towards optimal learning of language models. arXiv preprint arXiv:2402.17759, 2024.
  29. Scaling laws and compute-optimal training beyond fixed training durations. arXiv preprint arXiv:2405.18392, 2024.
  30. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487, 2022.
  31. Training compute-optimal large language models. ArXiv, abs/2203.15556, 2022. URL https://api.semanticscholar.org/CorpusID:247778764.
  32. Diversified batch selection for training acceleration. arXiv preprint arXiv:2406.04872, 2024.
  33. Marcus Hutter. Learning curve theory. arXiv preprint arXiv:2102.04074, 2021.
  34. CASED: curriculum adaptive sampling for extreme data imbalance. CoRR, abs/1807.10819, 2018. URL http://arxiv.org/abs/1807.10819.
  35. No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models, July 2023. URL http://arxiv.org/abs/2307.06440. arXiv:2307.06440 [cs].
  36. Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32, 2019.
  37. Get more for less: Principled data selection for warming up fine-tuning in llms. In ICLR, 2024. URL https://openreview.net/forum?id=QmYNBVukex.
  38. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  39. Equinox: neural networks in JAX via callable PyTrees and filtered transformations. Differentiable Programming workshop at Neural Information Processing Systems 2021, 2021.
  40. Douglas H Lawrence. The transfer of a discrimination along a continuum. Journal of Comparative and Physiological Psychology, 45(6):511, 1952.
  41. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
  42. Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2947–2962, 2023. URL https://api.semanticscholar.org/CorpusID:259515154.
  43. I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  44. At which training stage does code data help LLMs reasoning? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KIPJKST4gw.
  45. Tilting the odds at the lottery: the interplay of overparameterisation and curricula in neural networks. arXiv preprint arXiv:2406.01589, 2024.
  46. Learning latent permutations with gumbel-sinkhorn networks. arXiv preprint arXiv:1802.08665, 2018.
  47. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pp. 15630–15649. PMLR, 2022.
  48. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
  49. The lambada dataset: Word prediction requiring a broad discourse context. ArXiv, abs/1606.06031, 2016. URL https://api.semanticscholar.org/CorpusID:2381275.
  50. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  51. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557.
  52. Infobatch: Lossless training speed up by unbiased dynamic data pruning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=C61sk5LsK6.
  53. An adversarial winograd schema challenge at scale. 2019. URL https://api.semanticscholar.org/CorpusID:199370376.
  54. The cost of training NLP models: A concise overview. CoRR, abs/2004.08900, 2020. URL https://arxiv.org/abs/2004.08900.
  55. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  56. Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, 35(2):876–879, 1964.
  57. Burrhus F Skinner. Reinforcement today. American Psychologist, 13(3):94, 1958.
  58. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
  59. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
  60. Roformer: enhanced transformer with rotary position embedding. corr abs/2104.09864 (2021). arXiv preprint arXiv:2104.09864, 2021.
  61. Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. ArXiv, abs/1807.07532, 2018. URL https://api.semanticscholar.org/CorpusID:49882848.
  62. Scaling law with learning rate annealing, 2024. URL https://arxiv.org/abs/2408.11029.
  63. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  64. Attention is all you need. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/CorpusID:13756489.
  65. Rethinking data shapley for data selection tasks: Misleads and merits. arXiv preprint arXiv:2405.03875, 2024.
  66. Crowdsourcing multiple choice science questions. ArXiv, abs/1707.06209, 2017. URL https://api.semanticscholar.org/CorpusID:1553193.
  67. When do curricula work? In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=tW4QEInpni.
  68. Sheared LLaMA: Accelerating language model pre-training via structured pruning. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=09iOdaeOzp.
  69. LESS: Selecting influential data for targeted instruction tuning. In Forty-first International Conference on Machine Learning, 2024b. URL https://openreview.net/forum?id=PG5fV50maR.
  70. Reparameterizable subset sampling via continuous relaxations, 2021. URL https://arxiv.org/abs/1901.10517.
  71. Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2024.
  72. Data mixing laws: Optimizing data mixtures by predicting language modeling performance. arXiv preprint arXiv:2403.16952, 2024.
  73. Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:159041722.
  74. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  75. midGPT: a simple and hackable repository for llm pretraining. 2023. URL https://github.com/AllanYangZhou/midGPT.
Citations (3)

Summary

  • The paper introduces ADO, a dynamic algorithm that optimizes data sampling in real time by leveraging per-domain scaling laws.
  • It employs domain-specific scaling laws to predict loss trajectories, emphasizing data with higher learning potential during training.
  • Empirical tests on up to 1.3B parameter models show ADO enhances zero-shot accuracy and achieves performance on par with or better than existing methods.

Overview of Adaptive Data Optimization in Foundation Model Training

The paper presents an innovative approach called Adaptive Data Optimization (ADO), addressing the intricacies of optimizing data mixtures for training foundation models. Foundation models, characterized as vast neural networks pre-trained on extensive datasets, hinge on the effective composition of pretraining data to achieve optimal performance. This research outlines a methodology that transcends existing data selection strategies by tackling the computational overhead typically associated with proxy models and multi-staged training.

Key Concepts and Methodology

The primary innovation in this work is ADO, which dynamically optimizes data distributions in an online fashion, synchronous with the model training process. Unlike conventional methods that require extensive preliminary experiments with smaller models or depend on proxy models, ADO leverages per-domain scaling laws to evaluate the learning potential of each data domain in real-time. This algorithm introduces a streamlined and scalable solution, integrating easily into existing training workflows without necessitating changes to model update processes.

Domain Scaling Laws: ADO employs domain-specific scaling laws to predict the model's loss trajectory across distinct domains, utilizing a power law formulation. This is significant as it estimates how each domain contributes to the overall learning process, factoring in the reducible loss and the inherent learning speed of each domain.

Adaptive Distribution: The method continuously refines the data sampling distribution, allowing domains that promise greater learning potential—defined by higher learning speed and reducible loss—to receive more emphasis. This approach is akin to curriculum learning but is distinctive in its online and automated adaptation throughout the training period.

Empirical Results

The experimental phase evaluated ADO on the Pile dataset using LLMs at scales up to 1.3 billion parameters. The results demonstrated that ADO achieves comparable or superior performance to current techniques such as DoReMi and ODM across various benchmarks. Importantly, ADO accomplished this with minimal additional computational time, maintaining computational efficiency and scalability across different model sizes.

ADO's ability to enhance model performance was apparent in multiple dimensions:

  • Validation Loss: While ADO slightly underperforms on the Pile validation set, it improves the validation loss on SlimPajama and FineWeb subsets, suggesting a tilt towards higher-quality data selection even without explicit curation.
  • Zero-shot Performance: ADO's dynamic data mixture significantly improves zero-shot accuracy across diverse benchmarks, indicating better generalization capabilities.

Implications and Future Directions

The implications of this work are multifaceted, advancing both practical and theoretical aspects of AI research. Practically, ADO's seamless integration into existing training pipelines presents an accessible mechanism for improving model efficacy without substantial computational investments. Theoretically, it highlights the potential of online adaptive learning mechanisms, posing new questions about their application across broader AI domains.

Future research could extend ADO to even larger models and datasets, exploring its adaptability and efficacy in more complex environments. Moreover, incorporating sophisticated scaling laws that account for inter-domain interactions or adaptive learning rate schedules could further refine the model's selection capabilities. Overall, ADO represents a significant milestone in automating and optimizing data selection in the rapidly evolving field of AI and machine learning.