Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How to Train Data-Efficient LLMs (2402.09668v1)

Published 15 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: The training of LLMs is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.

Optimizing LLM Training: Advances in Data Efficiency

Introduction to Data Efficiency in LLMs

The efficiency of training LLMs stands as a critical concern within the machine learning community, given the substantial computational resources necessary for processing extensive data volumes. This paper explores innovative strategies aimed at enhancing the data efficiency of pre-training LLMs, focusing on optimizing the trade-offs between model quality and consumption of data and computational resources. The researchers introduce two primary techniques: Ask-LLM for assessing the quality of training examples and Density sampling for promoting diversity in the training data. Through a comprehensive evaluation, including 19 distinct data samplers and extensive downstream task performance assessment, the paper elucidates the superiority of these methods in improving data utilization efficiency.

Key Contributions

The paper's contributions are manifold, presenting novel sampling methods and providing deep insights into the trade-offs and considerations in data-efficient LLM training:

  • Ask-LLM Sampling emerges as a remarkably effective technique, capable of enhancing model performance even when discarding up to 90% of the training data. This method involves using a smaller proxy LLM to evaluate and prioritize high-quality training examples.
  • Exhaustive Benchmarking of 19 sampling strategies offers a comprehensive overview of their comparative efficacy across a spectrum of downstream tasks, bringing valuable insights into the varying roles of coverage, quality, and sampling cost in LLM pre-training.
  • New Insights into the dynamics of coverage versus quality in data selection are meticulously analyzed. The interplay between these factors highlights distinct advantages and demonstrates under which circumstances each approach yields the most substantial benefits.

Methodological Overview

Ask-LLM Sampling

The Ask-LLM technique represents a significant shift towards leveraging the inherent reasoning capabilities of instruction-tuned LLMs to ascertain the instructional quality of training data. This approach not only facilitates the identification of high-impact training examples but also speeds up the convergence time by up to 70%.

Density Sampling

Density sampling introduces an innovative approach to maximizing the diversity of training data. By modeling the data distribution, this technique effectively selects a varied sample that broadens the coverage of latent topics within the training dataset.

Experimental Insights

The experimental findings are revealing, suggesting distinct advantages in employing LLM-based quality rating for data selection:

  • Performance Benefits: Models trained on Ask-LLM selected data consistently outperform those trained on the entirety of the dataset, showcasing the effectiveness of quality-focused data pruning.
  • Data Reduction without Performance Loss: Remarkably, the Ask-LLM method enables training LLMs with significantly reduced datasets—rejecting up to 90% of the data—while maintaining or even improving model performance.
  • Rapid Convergence: The rate of model convergence is notably accelerated when training on Ask-LLM filtered data, presenting a compelling case for its practical application in LLM training routines.

Implications and Future Directions

This research presents a leap forward in the pursuit of data-efficient LLM pre-training methodologies. It opens avenues for more sustainable and cost-effective LLM development by underscoring the possibility of reducing data requirements without compromising on model quality. Future explorations may explore refining LLM-based quality scoring mechanisms and expanding the application of these techniques to broader contexts in AI training paradigms. The promising outcomes of the Ask-LLM and Density sampling methods indicate a substantial potential for not only mitigating the computational intensity of LLM training but also for enhancing the overall quality and efficiency of generative AI models.

Conclusions

This paper asserts the substantial benefits of targeted data selection strategies in training more efficient and potent LLMs. By prioritizing data quality and diversity through advanced sampling techniques, it is possible to significantly improve the efficiency of the training process. The success of the Ask-LLM and Density sampling methods presents an exciting frontier in the quest for more sustainable and effective AI model training, promising considerable reductions in computational demands while elevating model performance.

Acknowledgements and Impact

The paper concludes by acknowledging the collaborative efforts and contributions to its research, while also contemplating the broader impact of data-efficient LLM pre-training. The improvements in training efficiency not only hold potential for economic and environmental benefits but also chart a course towards more accessible and scalable AI technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (90)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
  2. Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649, 2023.
  3. Self-consuming generative models go mad. arXiv preprint arXiv:2307.01850, 2023.
  4. Palm 2 technical report, 2023.
  5. Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. arXiv preprint arXiv:2302.06960, 2023a.
  6. Data pruning and neural scaling laws: fundamental limitations of score-based algorithms. Transactions on Machine Learning Research, 2023b. ISSN 2835-8856. URL https://openreview.net/forum?id=iRTL4pDavo.
  7. Coresets via bilevel optimization for continual learning and streaming. Advances in Neural Information Processing Systems, 33:14879–14890, 2020.
  8. Large language models suffer from their own output: An analysis of the self-consuming training loop. arXiv preprint arXiv:2311.16822, 2023.
  9. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005, 2013.
  10. Super-samples from kernel herding. arXiv preprint arXiv:1203.3472, 2012.
  11. Training data subset search with ensemble active learning. IEEE Transactions on Intelligent Transportation Systems, 23(9):14741–14752, 2021.
  12. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  13. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  14. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  15. Sub-linear race sketches for approximate kernel density estimation on streaming data. In Proceedings of The Web Conference 2020, pp.  1739–1749, 2020.
  16. One-pass diversified sampling with application to terabyte-scale genomic sequence streams. In International Conference on Machine Learning, pp. 4202–4218. PMLR, 2022.
  17. Selection via proxy: Efficient data selection for deep learning. In ICLR, 2020.
  18. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pp.  253–262, 2004.
  19. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  20. Devroye, L. The equivalence of weak, strong and complete convergence in l1 for kernel density estimates. The Annals of Statistics, pp.  896–904, 1983.
  21. Dsdm: Model-aware dataset selection with datamodels, 2024.
  22. Turning big data into tiny data: Constant-size coresets for k-means, pca, and projective clustering. SIAM Journal on Computing, 49(3):601–657, 2020.
  23. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
  24. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  25. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021.
  26. Deepcore: A comprehensive library for coreset selection in deep learning. In International Conference on Database and Expert Systems Applications, pp.  181–195. Springer, 2022a.
  27. Deepcore: A comprehensive library for coreset selection in deep learning. In International Conference on Database and Expert Systems Applications, pp.  181–195. Springer, 2022b.
  28. Wiki-40b: Multilingual language model dataset. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pp.  2440–2452, 2020.
  29. Simfluence: Modeling the influence of individual training examples by simulating training runs. arXiv preprint arXiv:2303.08114, 2023.
  30. Hastings, W. K. Monte carlo sampling methods using markov chains and their applications. 1970.
  31. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  32. Teaching machines to read and comprehend. Advances in neural information processing systems, 28, 2015.
  33. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022.
  34. Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp.  100–108, 2014.
  35. Phi-2: The surprising power of small language models, 2023.
  36. Accelerating deep learning by focusing on the biggest losers. arXiv preprint arXiv:1910.00762, 2019.
  37. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  38. Discrepancy, coresets, and sketches in machine learning. In Conference on Learning Theory, pp.  1975–1993. PMLR, 2019.
  39. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pp. 2525–2534. PMLR, 2018.
  40. Unifiedqa: Crossing format boundaries with a single qa system. arXiv preprint arXiv:2005.00700, 2020.
  41. Distance-sensitive bloom filters. In 2006 Proceedings of the Eighth Workshop on Algorithm Engineering and Experiments (ALENEX), pp.  41–50. SIAM, 2006.
  42. Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data. arXiv preprint arXiv:2306.13840, 2023.
  43. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8424–8445, 2022.
  44. One-pass distribution sketch for measuring data heterogeneity in federated learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  45. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023a.
  46. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. arXiv preprint arXiv:2305.13169, 2023b.
  47. Estimating the carbon footprint of bloom, a 176b parameter language model. Journal of Machine Learning Research, 24(253):1–15, 2023.
  48. Coresets for classification–simplified and strengthened. Advances in Neural Information Processing Systems, 34:11643–11654, 2021.
  49. Rephrasing the web: A recipe for compute and data-efficient language modeling, 2024.
  50. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564, 2023.
  51. Trivial or impossible–dichotomous data difficulty masks model differences (on imagenet and beyond). arXiv preprint arXiv:2110.05922, 2021.
  52. A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772, 2021.
  53. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, pp. 15630–15649. PMLR, 2022.
  54. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264, 2023.
  55. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021.
  56. OpenAI. Gpt-4 technical report, 2023.
  57. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191, 2021.
  58. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607, 2021.
  59. Phillips, J. M. Coresets and sketches. In Handbook of discrete and computational geometry, pp. 1269–1288. Chapman and Hall/CRC, 2017.
  60. Formal algorithms for transformers. arXiv preprint arXiv:2207.09238, 2022.
  61. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  62. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  63. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  64. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
  65. Rosenblatt, M. Remarks on some nonparametric estimates of a density function. The annals of mathematical statistics, pp.  832–837, 1956.
  66. Data distillation: A survey. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. Survey Certification.
  67. Svp-cf: Selection via proxy for collaborative filtering data. arXiv preprint arXiv:2107.04984, 2021.
  68. Farzi data: Autoregressive data distillation. arXiv preprint arXiv:2310.09983, 2023.
  69. Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 2014 SIAM International Conference on Data Mining, pp.  542–550. SIAM, 2014.
  70. Mixture-of-experts meets instruction tuning: A winning combination for large language models. arXiv preprint arXiv:2305.14705, 2023.
  71. The curse of recursion: Training on generated data makes models forget.(5 2023). URl: https://arxiv. org/abs/2305.17493, 2023.
  72. Rehashing kernel evaluation in high dimensions. In International Conference on Machine Learning, pp. 5789–5798. PMLR, 2019.
  73. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
  74. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  75. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243, 2019.
  76. D4: Improving llm pretraining via document de-duplication and diversification. arXiv preprint arXiv:2308.12284, 2023.
  77. An empirical study of example forgetting during deep neural network learning. In ICLR, 2019.
  78. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  79. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  80. On coresets for support vector machines. Theoretical Computer Science, 890:171–191, 2021.
  81. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  82. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
  83. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  84. Weng, L. Large transformer model inference optimization. Lil’Log, Jan 2023. URL https://lilianweng.github.io/posts/2023-01-10-inference-optimization/.
  85. Ccnet: Extracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359, 2019.
  86. Consistency of the kernel density estimator: a survey. Statistical Papers, 53(1):1–21, 2012.
  87. Doremi: Optimizing data mixtures speeds up language model pretraining. arXiv preprint arXiv:2305.10429, 2023a.
  88. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023b.
  89. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  90. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Noveen Sachdeva (15 papers)
  2. Benjamin Coleman (21 papers)
  3. Wang-Cheng Kang (16 papers)
  4. Jianmo Ni (31 papers)
  5. Lichan Hong (35 papers)
  6. Ed H. Chi (74 papers)
  7. James Caverlee (56 papers)
  8. Julian McAuley (238 papers)
  9. Derek Zhiyuan Cheng (12 papers)
Citations (30)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com