Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DsDm: Model-Aware Dataset Selection with Datamodels (2401.12926v1)

Published 23 Jan 2024 in cs.LG and stat.ML

Abstract: When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior. However, in practice the opposite can often happen: we find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. To develop better methods for selecting data, we start by framing dataset selection as an optimization problem that we can directly solve for: given target tasks, a learning algorithm, and candidate data, select the subset that maximizes model performance. This framework thus avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks. Our resulting method greatly improves LLM (LM) performance on both pre-specified tasks and previously unseen tasks. Specifically, choosing target tasks representative of standard LM problems and evaluating on diverse held-out benchmarks, our selected datasets provide a 2x compute multiplier over baseline methods.

Overview of "DsDm: Model-Aware Dataset Selection with Datamodels"

The paper "DsDm: Model-Aware Dataset Selection with Datamodels" by Engstrom, Feldmann, and Madry introduces a novel approach for dataset selection that promises to enhance the training of large-scale LLMs (LMs) by focusing on maximizing model performance rather than merely relying on traditional quality metrics. Through a detailed exploration of datamodels, this research presents a robust framework for training data optimization which diverges from standard methodologies that prioritize data similarity to preselected high-quality sources.

Core Contributions

The authors start by challenging the common practice of selecting training data based on similarity to high-quality datasets, such as Wikipedia, which they argue does not necessarily lead to improved model performance. They offer an innovative alternative by framing dataset selection as an optimization problem aimed at improving model outcomes across various target tasks. This process is realized through the development of "Dataset Selection with Datamodels" (DsDm).

DsDm stands out by explicitly modeling how the learning process utilizes training data subsets to predict on target tasks. The approach is centered on datamodels, which are employed to approximate the relationship between data subset choices and model performance efficiently. This method facilitates selecting data subsets predicted to be most beneficial for enhancing LM tasks.

Evaluation and Results

In a rigorous experimental evaluation, the authors demonstrate the efficacy of DsDm across multiple LM tasks, including SQuAD, LAMBADA, Jeopardy, and CS-Algorithms. Their method consistently outperformed traditional selection methods, which often failed to surpass the performance of randomly selected subsets of the data. Specifically, DsDm provided what they refer to as a "2 compute multiplier," meaning that LMs trained on DsDm-selected datasets performed as if they had been trained with twice the computational resources under traditional random selection methods.

Furthermore, when tasked with improving broader model generalization, DsDm was able to adeptly enhance performance on a wide range of unseen benchmarks by choosing target tasks aligned with anticipated deployment scenarios. This indicates considerable potential for DsDm in real-world applications where model versatility across unknown future tasks is required.

Implications and Future Directions

The implications of this research are twofold. Practically, DsDm has the potential to significantly reduce computational requirements and resource expenditures while maintaining, or even improving, model quality. Theoretically, it provides insights into the importance of the model training process itself in data selection, challenging assumptions that high textual similarity equates to high utility.

Future work could extend to refining datamodel approximations or exploring their applications in diversely structured data environments, including reinforcement learning or complex multi-modal datasets. Additionally, extending this framework beyond LMs into other domains could reveal further efficiencies and improvements.

The paper opens a promising avenue in AI research, suggesting that a deeper understanding and integration of the model training process into dataset selection criteria can yield powerful results, moving the field towards more intelligent and resource-efficient AI development strategies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
  2. Efficient online data mixing for language model pre-training. Training, 20000(40000):60000, 2023.
  3. Amittai Axelrod. Cynical selection of language model training data. arXiv preprint arXiv:1709.02279, 2017.
  4. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  5. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  6. Hermanus Josephus Bierens. The nadaraya-watson kernel regression function estimator. 1988.
  7. Piqa: Reasoning about physical commonsense in natural language, 2019.
  8. Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745.
  9. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  10. Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023.
  11. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023a.
  12. Skill-it! a data-driven skills framework for understanding and training language models. arXiv preprint arXiv:2307.14430, 2023b.
  13. Training data subset search with ensemble active learning. IEEE Transactions on Intelligent Transportation Systems, 23(9):14741–14752, 2021.
  14. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  15. Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019.
  16. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  17. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations (ICLR), 2020.
  18. Together Computer. Redpajama: an open dataset for training large language models. https://github.com/togethercomputer/RedPajama-Data, October 2023.
  19. R Dennis Cook. Detection of influential observation in linear regression. Technometrics, 19(1):15–18, 1977.
  20. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR, 2022.
  21. Automatic document selection for efficient encoder pretraining. arXiv preprint arXiv:2210.10951, 2022.
  22. Wikimedia Foundation. English wikipedia. https://huggingface.co/datasets/wikipedia, 2022.
  23. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  24. A swiss army infinitesimal jackknife. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1139–1147. PMLR, 2019.
  25. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
  26. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301, 2015.
  27. Training compute-optimal large language models. In arXiv preprint arXiv:2203.15556, 2022.
  28. Datamodels: Predicting predictions from training data. In International Conference on Machine Learning (ICML), 2022.
  29. Extensions of lipschitz mappings into a hilbert space. In Contemporary mathematics, 1984.
  30. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
  31. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
  32. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  33. Learning from less data: A unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1289–1299. IEEE, 2019.
  34. Backdoor or feature? a new perspective on data poisoning. 2022.
  35. Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8110–8118, 2021a.
  36. Retrieve: Coreset selection for efficient and robust semi-supervised learning. Advances in Neural Information Processing Systems, 34:14488–14501, 2021b.
  37. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, 2017.
  38. Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018.
  39. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  40. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pages 15630–15649. PMLR, 2022.
  41. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pages 6950–6960. PMLR, 2020.
  42. Intelligent selection of language model training data. In Proceedings of the ACL 2010 conference short papers, pages 220–224, 2010.
  43. MosaicML. Composer, 2021. URL https://www.github.com/mosaicml/composer.
  44. MosaicML. LLM Foundry, 2023. URL https://www.github.com/mosaicml/llm-foundry.
  45. Repeated random sampling for minimizing the time-to-accuracy of learning. arXiv preprint arXiv:2305.18424, 2023.
  46. The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
  47. Trak: Attributing model behavior at scale. In Arxiv preprint arXiv:2303.14186, 2023.
  48. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021.
  49. Jeff M Phillips. Coresets and sketches. In Handbook of discrete and computational geometry, pages 1269–1288. Chapman and Hall/CRC, 2017.
  50. Daryl Pregibon. Logistic regression diagnostics. In The Annals of Statistics, 1981.
  51. Shawn Presser. Bookcorpusopen. https://huggingface.co/datasets/bookcorpusopen, 2021.
  52. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  53. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 2020.
  54. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  55. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  56. Coqa: A conversational question answering challenge, 2019.
  57. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
  58. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  59. Understanding influence functions and datamodels via harmonic analysis. In ICLR, 2023.
  60. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017.
  61. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  62. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  63. Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830, 2016.
  64. Bojan Tunguz. Jeopardy! questions, 2019. URL https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions.
  65. Optimizing data usage via differentiable rewards. In International Conference on Machine Learning, pages 9983–9995. PMLR, 2020.
  66. Submodularity in data subset selection and active learning. In International conference on machine learning, pages 1954–1963. PMLR, 2015.
  67. Doremi: Optimizing data mixtures speeds up language model pretraining. arXiv preprint arXiv:2305.10429, 2023a.
  68. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023b.
  69. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
  70. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  71. Understanding bag-of-words model: a statistical framework. International journal of machine learning and cybernetics, 2010.
  72. Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics, pages 185–205, 2005.
  73. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  74. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Logan Engstrom (27 papers)
  2. Axel Feldmann (4 papers)
  3. Aleksander Madry (86 papers)
Citations (29)