Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Management For Training Large Language Models: A Survey (2312.01700v3)

Published 4 Dec 2023 in cs.CL and cs.AI

Abstract: Data plays a fundamental role in training LLMs. Efficient data management, particularly in formulating a well-suited training dataset, is significant for enhancing model performance and improving training efficiency during pretraining and supervised fine-tuning stages. Despite the considerable importance of data management, the underlying mechanism of current prominent practices are still unknown. Consequently, the exploration of data management has attracted more and more attention among the research community. This survey aims to provide a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs, covering various aspects of data management strategy design. Looking into the future, we extrapolate existing challenges and outline promising directions for development in this field. Therefore, this survey serves as a guiding resource for practitioners aspiring to construct powerful LLMs through efficient data management practices. The collection of the latest papers is available at https://github.com/ZigeW/data_management_LLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (139)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540.
  2. Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning. arXiv preprint arXiv:2307.03692.
  3. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all.
  4. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  5. Jean-michel Attendu and Jean-philippe Corbeil. 2023. NLU on data diets: Dynamic data subset selection for NLP classification tasks. In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 129–146, Toronto, Canada (Hybrid). Association for Computational Linguistics.
  6. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
  7. Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  9. Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290.
  10. Data-juicer: A one-stop data processing system for large language models. arXiv preprint arXiv:2309.02033.
  11. Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning. arXiv preprint arXiv:2305.09246.
  12. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701.
  13. Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757.
  14. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  15. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  16. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
  17. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 1286–1305.
  18. How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492.
  19. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th international conference on World wide web, pages 577–586.
  20. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR.
  21. On the origin of hallucinations in conversational models: Is it the datasets or the models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5271–5285, Seattle, United States. Association for Computational Linguistics.
  22. Doge: Domain reweighting with generalization estimation. arXiv preprint arXiv:2310.15393.
  23. Bert on a data diet: Finding important examples by gradient-based pruning. arXiv preprint arXiv:2211.05610.
  24. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 11737–11762.
  25. Paul Friedl. 2023. Dis/similarities in the design and development of legal and algorithmic normative systems: the case of perspective api. Law, Innovation and Technology, 15(1):25–59.
  26. Leo Gao. 2021. An empirical exploration in quality filtering of text data. arXiv preprint arXiv:2109.00698.
  27. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  28. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
  29. Analyzing and addressing the difference in toxicity prediction between different comments with same semantic meaning in google’s perspective api. In ICT Systems and Sustainability: Proceedings of ICT4SD 2022, pages 455–464. Springer.
  30. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462.
  31. Demystifying prompts in language models via perplexity estimation. arXiv preprint arXiv:2212.04037.
  32. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717.
  33. Textbooks are all you need. arXiv preprint arXiv:2306.11644.
  34. Data quality for machine learning tasks. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 4040–4041.
  35. Whose language counts as high quality? measuring language ideologies in text data selection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2562–2580, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  36. Can large language models understand real-world complex instructions? arXiv preprint arXiv:2309.09150.
  37. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487.
  38. An empirical analysis of compute-optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030.
  39. Data-efficient finetuning using cross-task nearest neighbors. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9036–9061, Toronto, Canada. Association for Computational Linguistics.
  40. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
  41. Overview and importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3561–3562.
  42. Exploring the benefits of training expert language models over instruction tuning. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 14702–14729. PMLR.
  43. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases. arXiv preprint arXiv:2303.14742.
  44. Jean Kaddour. 2023. The minipile challenge for data-efficient language models. arXiv preprint arXiv:2304.08442.
  45. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning, pages 10697–10707. PMLR.
  46. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  47. Prompt waywardness: The curious case of discretized interpretation of continuous prompts. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3631–3643, Seattle, United States. Association for Computational Linguistics.
  48. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327.
  49. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  50. Po-Nien Kung and Nanyun Peng. 2023. Do models really learn to follow instructions? an empirical study of instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1317–1328, Toronto, Canada. Association for Computational Linguistics.
  51. Measuring bias in contextualized word representations. arXiv preprint arXiv:1906.07337.
  52. Beyond scale: the diversity coefficient as a data quality metric demonstrates llms are pre-trained on formally diverse data. arXiv preprint arXiv:2306.13840.
  53. Platypus: Quick, cheap, and powerful refinement of llms. arXiv preprint arXiv:2308.07317.
  54. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 8424–8445.
  55. A new generation of perspective api: Efficient multilingual character-level transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3197–3207.
  56. How pre-trained language models capture factual knowledge? a causal-inspired analysis. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1720–1732.
  57. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259.
  58. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463.
  59. Exploring format consistency for instruction tuning. arXiv preprint arXiv:2307.15504.
  60. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  61. The flan collection: Designing data and methods for effective instruction tuning. In ICML.
  62. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. arXiv preprint arXiv:2305.13169.
  63. #instag: Instruction tagging for analyzing supervised fine-tuning of large language models.
  64. Alexandra Sasha Luccioni and Joseph D Viviano. 2021. What’s in the box? a preliminary analysis of undesirable content in the common crawl corpus. arXiv preprint arXiv:2105.02732.
  65. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  66. D2 pruning: Message passing for balancing diversity and difficulty in data pruning. arXiv preprint arXiv:2310.07931.
  67. When less is more: Investigating data pruning for pretraining llms at scale. arXiv preprint arXiv:2309.04564.
  68. Philip M McCarthy and Scott Jarvis. 2010. Mtld, vocd-d, and hd-d: A validation study of sophisticated approaches to lexical diversity assessment. Behavior research methods, 42(2):381–392.
  69. Sources of hallucination by large language models on inference tasks. arXiv preprint arXiv:2305.14552.
  70. An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 1878–1898.
  71. The curse of low task diversity: On the failure of transfer learning to outperform maml and their empirical equivalence. arXiv preprint arXiv:2208.01545.
  72. Reframing instructional prompts to GPTk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, pages 589–612, Dublin, Ireland. Association for Computational Linguistics.
  73. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264.
  74. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  75. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003.
  76. Crows-pairs: A challenge dataset for measuring social biases in masked language models. arXiv preprint arXiv:2010.00133.
  77. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. arXiv preprint arXiv:2309.09400.
  78. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309.
  79. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations.
  80. OpenAI. 2023. Gpt-4 technical report.
  81. Distributionally robust language modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4227–4237.
  82. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  83. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34:20596–20607.
  84. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data only. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  85. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  86. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167.
  87. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  88. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  89. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  90. D.B. Rubin. 1988. Using the sir algorithm to simulate posterior distributions. In Bayesian statistics 3. Proceedings of the third Valencia international meeting, 1-5 June 1987, pages 395–402. Clarendon Press.
  91. Distributionally robust neural networks. In International Conference on Learning Representations.
  92. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  93. Slimpajama-dc: Understanding data combinations for llm training. arXiv preprint arXiv:2309.10818.
  94. An empirical study of instruction-tuning large language models in chinese.
  95. Noise-robust de-duplication at scale. In The Eleventh International Conference on Learning Representations.
  96. Dynamics of instruction tuning: Each ability of large language models has its own growth pace. arXiv preprint arXiv:2310.19651.
  97. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  98. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  99. D4: Improving llm pretraining via document de-duplication and diversification. arXiv preprint arXiv:2308.12284.
  100. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  101. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  102. Attention is all you need. Advances in neural information processing systems, 30.
  103. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv preprint arXiv:2211.04325.
  104. Explore-instruct: Enhancing domain-specific instruction coverage through active exploration. arXiv preprint arXiv:2310.09168.
  105. Economic hyperparameter optimization with blended search strategy. In International Conference on Learning Representations.
  106. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
  107. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
  108. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  109. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109.
  110. Harnessing the power of david against goliath: Exploring instruction data generation without using closed-source models. arXiv preprint arXiv:2308.12711.
  111. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
  112. Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning. arXiv preprint arXiv:2310.13486.
  113. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  114. Emergent abilities of large language models. Transactions on Machine Learning Research.
  115. Challenges in detoxifying language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pages 2447–2469.
  116. Ccnet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012.
  117. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  118. Self-evolved diverse data sampling for efficient instruction tuning. arXiv preprint arXiv:2311.08182.
  119. Doremi: Optimizing data mixtures speeds up language model pretraining. arXiv preprint arXiv:2305.10429.
  120. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169.
  121. Detoxifying language models risks marginalizing minority voices. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 2390–2397.
  122. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  123. To repeat or not to repeat: Insights from scaling llm under token-crisis. arXiv preprint arXiv:2305.13230.
  124. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv preprint arXiv:2305.18752.
  125. Xlnet: Generalized autoregressive pretraining for language understanding. In Neural Information Processing Systems.
  126. Dynosaur: A dynamic growth paradigm for instruction-tuning data curation. arXiv preprint arXiv:2305.14327.
  127. Did you read the instructions? rethinking the effectiveness of task definitions in instruction learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3063–3079, Toronto, Canada. Association for Computational Linguistics.
  128. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825.
  129. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
  130. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  131. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792.
  132. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  133. A preliminary study of the intrinsic relationship between complexity and alignment. arXiv preprint arXiv:2308.05696.
  134. Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103.
  135. Towards a unified multi-dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2023–2038, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  136. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
  137. Lobass: Gauging learnability in supervised fine-tuning data. arXiv preprint arXiv:2310.13008.
  138. Oasis: Data curation and assessment system for pretraining of large language models. arXiv preprint arXiv:2311.12537.
  139. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
Citations (14)

Summary

  • The paper details how data curation techniques, including deduplication and toxicity filtering, critically impact both pretraining and fine-tuning performance.
  • It demonstrates that assembling diverse, high-quality datasets is essential for enhancing LLM capabilities and achieving effective model generalization.
  • The study highlights future directions, such as developing adaptable multimodal data management frameworks to further advance LLM performance.

The evolution of LLMs has been marked by significant advancements in natural language processing capabilities. Crucial to the training and fine-tuning of these models is the management of data—a task that is essential yet poses notable challenges.

The training of LLMs involves two primary stages: pretraining and supervised fine-tuning. In the pretraining phase, the goal is to build datasets infused with high-quality, heterogeneous data that span various domains. This is key to equipping models with broad capabilities. However, detailed documentation on the construction of such pretraining data remains scarce for many leading LLMs. Meanwhile, the supervised fine-tuning phase capitalizes on carefully assembled instructional datasets to enhance LLMs' performance on specific tasks.

Emerging research shows significant focus on strategies that affect model performance, such as data quantity, quality, domain/task composition, and management systems. For instance, while scaling laws suggest a relationship between model size and data quantity, the performance impact of repeated data use has sparked debate. Deduplication and quality filtering form crucial parts of data management pipelines, with toxicity filtering particularly important in avoiding undesired text generation. Diverse dataset composition is also crucial, as it contributes to broader functional abilities in LLMs.

LLMs’ fine-tuning performance is intricately linked with the quality of instructional data. Studies reveal that high-quality instruction datasets with diverse, complex prompts can lead to better fine-tuning outcomes. Furthermore, optimal task composition during fine-tuning holds the key to achieving generalization in LLMs. Nevertheless, practitioners often face challenges due to the unclear effects of instruction datasets on model performance, posing a hurdle in selecting the appropriate data management strategy for fine-tuning practices.

The provision of comprehensive overviews like the one surveyed here serves as valuable guidance for practitioners designing powerful LLMs. This resource is particularly useful in navigating the complexities of pretraining data management, instructive data curation, and future research directions in this field.

One notable future direction is the development of a general data management framework capable of adapting across diverse applications of LLMs. With advancements in LLMs leading to applications beyond mere text processing to multimodal capacities such as visual and audio data, the necessity for multimodal data management strategies is also increasing.

In summary, as the research community delves deeper into the intricacies of data management for LLMs, the prospect of significantly enhanced model performance and efficiency becomes increasingly tangible. This continuous pursuit of improved strategies and methodologies promises to further the capabilities and applications of LLMs in the field of artificial intelligence.

Youtube Logo Streamline Icon: https://streamlinehq.com