Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws (2410.11820v1)
Abstract: The composition of pretraining data is a key determinant of foundation models' performance, but there is no standard guideline for allocating a limited computational budget across different data sources. Most current approaches either rely on extensive experiments with smaller models or dynamic data adjustments that also require proxy models, both of which significantly increase the workflow complexity and computational overhead. In this paper, we introduce Adaptive Data Optimization (ADO), an algorithm that optimizes data distributions in an online fashion, concurrent with model training. Unlike existing techniques, ADO does not require external knowledge, proxy models, or modifications to the model update. Instead, ADO uses per-domain scaling laws to estimate the learning potential of each domain during training and adjusts the data mixture accordingly, making it more scalable and easier to integrate. Experiments demonstrate that ADO can achieve comparable or better performance than prior methods while maintaining computational efficiency across different computation scales, offering a practical solution for dynamically adjusting data distribution without sacrificing flexibility or increasing costs. Beyond its practical benefits, ADO also provides a new perspective on data collection strategies via scaling laws.
- Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
- Ranking via sinkhorn propagation. arXiv preprint arXiv:1106.1925, 2011.
- Efficient online data mixing for language model pre-training. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models, 2023.
- A survey on data selection for language models. arXiv preprint arXiv:2402.16827, 2024.
- Discrimination transfer along a pitch continuum. Journal of Experimental Psychology, 48(4):241, 1954.
- Deep learning through the lens of example difficulty. Advances in Neural Information Processing Systems, 34:10876–10889, 2021.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48, 2009.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
- Piqa: Reasoning about physical commonsense in natural language. In AAAI Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:208290939.
- Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
- On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021. URL https://arxiv.org/abs/2108.07258.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- Skill-it! a data-driven skills framework for understanding and training language models. Advances in Neural Information Processing Systems, 36, 2024.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018. URL https://api.semanticscholar.org/CorpusID:3922816.
- The DeepMind JAX Ecosystem, 2020. URL http://github.com/google-deepmind.
- The road less scheduled. arXiv preprint arXiv:2405.15682, 2024.
- Dsdm: Model-aware dataset selection with datamodels. arXiv preprint arXiv:2401.12926, 2024.
- Bad students make great teachers: Active learning accelerates large-scale visual understanding. arXiv preprint arXiv:2312.05328, 2023.
- Doge: Domain reweighting with generalization estimation. ArXiv, abs/2310.15393, 2023. URL https://api.semanticscholar.org/CorpusID:264439382.
- Reverse curriculum generation for reinforcement learning. ArXiv, abs/1707.05300, 2017. URL https://api.semanticscholar.org/CorpusID:19181872.
- Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/12608602.
- Data mixing made efficient: A bivariate scaling law for language model pretraining. arXiv preprint arXiv:2405.14908, 2024.
- Charles Albert Eric Goodhart. Monetary control—the british experience. In Monetary Conditions for Economic Recovery, pp. 59–84. Springer, 1985.
- Scaling laws for data filtering–data curation cannot be compute agnostic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22702–22711, 2024.
- Automated curriculum learning for neural networks. In international conference on machine learning, pp. 1311–1320. Pmlr, 2017.
- Towards optimal learning of language models. arXiv preprint arXiv:2402.17759, 2024.
- Scaling laws and compute-optimal training beyond fixed training durations. arXiv preprint arXiv:2405.18392, 2024.
- Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487, 2022.
- Training compute-optimal large language models. ArXiv, abs/2203.15556, 2022. URL https://api.semanticscholar.org/CorpusID:247778764.
- Diversified batch selection for training acceleration. arXiv preprint arXiv:2406.04872, 2024.
- Marcus Hutter. Learning curve theory. arXiv preprint arXiv:2102.04074, 2021.
- CASED: curriculum adaptive sampling for extreme data imbalance. CoRR, abs/1807.10819, 2018. URL http://arxiv.org/abs/1807.10819.
- No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models, July 2023. URL http://arxiv.org/abs/2307.06440. arXiv:2307.06440 [cs].
- Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32, 2019.
- Get more for less: Principled data selection for warming up fine-tuning in llms. In ICLR, 2024. URL https://openreview.net/forum?id=QmYNBVukex.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Equinox: neural networks in JAX via callable PyTrees and filtered transformations. Differentiable Programming workshop at Neural Information Processing Systems 2021, 2021.
- Douglas H Lawrence. The transfer of a discrimination along a continuum. Journal of Comparative and Physiological Psychology, 45(6):511, 1952.
- Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
- Logiqa 2.0—an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2947–2962, 2023. URL https://api.semanticscholar.org/CorpusID:259515154.
- I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- At which training stage does code data help LLMs reasoning? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KIPJKST4gw.
- Tilting the odds at the lottery: the interplay of overparameterisation and curricula in neural networks. arXiv preprint arXiv:2406.01589, 2024.
- Learning latent permutations with gumbel-sinkhorn networks. arXiv preprint arXiv:1802.08665, 2018.
- Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pp. 15630–15649. PMLR, 2022.
- Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
- The lambada dataset: Word prediction requiring a broad discourse context. ArXiv, abs/1606.06031, 2016. URL https://api.semanticscholar.org/CorpusID:2381275.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557.
- Infobatch: Lossless training speed up by unbiased dynamic data pruning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=C61sk5LsK6.
- An adversarial winograd schema challenge at scale. 2019. URL https://api.semanticscholar.org/CorpusID:199370376.
- The cost of training NLP models: A concise overview. CoRR, abs/2004.08900, 2020. URL https://arxiv.org/abs/2004.08900.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Richard Sinkhorn. A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics, 35(2):876–879, 1964.
- Burrhus F Skinner. Reinforcement today. American Psychologist, 13(3):94, 1958.
- SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
- Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
- Roformer: enhanced transformer with rotary position embedding. corr abs/2104.09864 (2021). arXiv preprint arXiv:2104.09864, 2021.
- Attention-guided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. ArXiv, abs/1807.07532, 2018. URL https://api.semanticscholar.org/CorpusID:49882848.
- Scaling law with learning rate annealing, 2024. URL https://arxiv.org/abs/2408.11029.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. In Neural Information Processing Systems, 2017. URL https://api.semanticscholar.org/CorpusID:13756489.
- Rethinking data shapley for data selection tasks: Misleads and merits. arXiv preprint arXiv:2405.03875, 2024.
- Crowdsourcing multiple choice science questions. ArXiv, abs/1707.06209, 2017. URL https://api.semanticscholar.org/CorpusID:1553193.
- When do curricula work? In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=tW4QEInpni.
- Sheared LLaMA: Accelerating language model pre-training via structured pruning. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=09iOdaeOzp.
- LESS: Selecting influential data for targeted instruction tuning. In Forty-first International Conference on Machine Learning, 2024b. URL https://openreview.net/forum?id=PG5fV50maR.
- Reparameterizable subset sampling via continuous relaxations, 2021. URL https://arxiv.org/abs/1901.10517.
- Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36, 2024.
- Data mixing laws: Optimizing data mixtures by predicting language modeling performance. arXiv preprint arXiv:2403.16952, 2024.
- Hellaswag: Can a machine really finish your sentence? In Annual Meeting of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:159041722.
- Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
- midGPT: a simple and hackable repository for llm pretraining. 2023. URL https://github.com/AllanYangZhou/midGPT.