DsDm: Model-Aware Dataset Selection with Datamodels (2401.12926v1)
Abstract: When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior. However, in practice the opposite can often happen: we find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data. To develop better methods for selecting data, we start by framing dataset selection as an optimization problem that we can directly solve for: given target tasks, a learning algorithm, and candidate data, select the subset that maximizes model performance. This framework thus avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks. Our resulting method greatly improves LLM (LM) performance on both pre-specified tasks and previously unseen tasks. Specifically, choosing target tasks representative of standard LM problems and evaluating on diverse held-out benchmarks, our selected datasets provide a 2x compute multiplier over baseline methods.
- Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
- Efficient online data mixing for language model pre-training. Training, 20000(40000):60000, 2023.
- Amittai Axelrod. Cynical selection of language model training data. arXiv preprint arXiv:1709.02279, 2017.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
- Hermanus Josephus Bierens. The nadaraya-watson kernel regression function estimator. 1988.
- Piqa: Reasoning about physical commonsense in natural language, 2019.
- Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023.
- Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023a.
- Skill-it! a data-driven skills framework for understanding and training language models. arXiv preprint arXiv:2307.14430, 2023b.
- Training data subset search with ensemble active learning. IEEE Transactions on Intelligent Transportation Systems, 23(9):14741–14752, 2021.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Boolq: Exploring the surprising difficulty of natural yes/no questions, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
- Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations (ICLR), 2020.
- Together Computer. Redpajama: an open dataset for training large language models. https://github.com/togethercomputer/RedPajama-Data, October 2023.
- R Dennis Cook. Detection of influential observation in linear regression. Technometrics, 19(1):15–18, 1977.
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR, 2022.
- Automatic document selection for efficient encoder pretraining. arXiv preprint arXiv:2210.10951, 2022.
- Wikimedia Foundation. English wikipedia. https://huggingface.co/datasets/wikipedia, 2022.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- A swiss army infinitesimal jackknife. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1139–1147. PMLR, 2019.
- Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus, 2019.
- The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301, 2015.
- Training compute-optimal large language models. In arXiv preprint arXiv:2203.15556, 2022.
- Datamodels: Predicting predictions from training data. In International Conference on Machine Learning (ICML), 2022.
- Extensions of lipschitz mappings into a hilbert space. In Contemporary mathematics, 1984.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1147. URL https://aclanthology.org/P17-1147.
- Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Learning from less data: A unified data subset selection and active learning framework for computer vision. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1289–1299. IEEE, 2019.
- Backdoor or feature? a new perspective on data poisoning. 2022.
- Glister: Generalization based data subset selection for efficient and robust learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8110–8118, 2021a.
- Retrieve: Coreset selection for efficient and robust semi-supervised learning. Advances in Neural Information Processing Systems, 34:14488–14501, 2021b.
- Understanding black-box predictions via influence functions. In International Conference on Machine Learning, 2017.
- Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pages 15630–15649. PMLR, 2022.
- Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pages 6950–6960. PMLR, 2020.
- Intelligent selection of language model training data. In Proceedings of the ACL 2010 conference short papers, pages 220–224, 2010.
- MosaicML. Composer, 2021. URL https://www.github.com/mosaicml/composer.
- MosaicML. LLM Foundry, 2023. URL https://www.github.com/mosaicml/llm-foundry.
- Repeated random sampling for minimizing the time-to-accuracy of learning. arXiv preprint arXiv:2305.18424, 2023.
- The lambada dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.
- Trak: Attributing model behavior at scale. In Arxiv preprint arXiv:2303.14186, 2023.
- Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021.
- Jeff M Phillips. Coresets and sketches. In Handbook of discrete and computational geometry, pages 1269–1288. Chapman and Hall/CRC, 2017.
- Daryl Pregibon. Logistic regression diagnostics. In The Annals of Statistics, 1981.
- Shawn Presser. Bookcorpusopen. https://huggingface.co/datasets/bookcorpusopen, 2021.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 2020.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Coqa: A conversational question answering challenge, 2019.
- Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Understanding influence functions and datamodels via harmonic analysis. In ICLR, 2023.
- Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489, 2017.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830, 2016.
- Bojan Tunguz. Jeopardy! questions, 2019. URL https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions.
- Optimizing data usage via differentiable rewards. In International Conference on Machine Learning, pages 9983–9995. PMLR, 2020.
- Submodularity in data subset selection and active learning. In International conference on machine learning, pages 1954–1963. PMLR, 2015.
- Doremi: Optimizing data mixtures speeds up language model pretraining. arXiv preprint arXiv:2305.10429, 2023a.
- Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023b.
- Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Understanding bag-of-words model: a statistical framework. International journal of machine learning and cybernetics, 2010.
- Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics, pages 185–205, 2005.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
- Logan Engstrom (27 papers)
- Axel Feldmann (4 papers)
- Aleksander Madry (86 papers)