Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
107 tokens/sec
Gemini 2.5 Pro Premium
58 tokens/sec
GPT-5 Medium
29 tokens/sec
GPT-5 High Premium
25 tokens/sec
GPT-4o
101 tokens/sec
DeepSeek R1 via Azure Premium
84 tokens/sec
GPT OSS 120B via Groq Premium
478 tokens/sec
Kimi K2 via Groq Premium
213 tokens/sec
2000 character limit reached

Balanced Data Sampling for Language Model Training with Clustering (2402.14526v2)

Published 22 Feb 2024 in cs.CL and cs.AI

Abstract: Data plays a fundamental role in the training of LLMs. While attention has been paid to the collection and composition of datasets, determining the data sampling strategy in training remains an open question. Most LLMs are trained with a simple strategy, random sampling. However, this sampling strategy ignores the unbalanced nature of training data distribution, which can be sub-optimal. In this paper, we propose ClusterClip Sampling to balance the text distribution of training data for better model training. Specifically, ClusterClip Sampling utilizes data clustering to reflect the data distribution of the training set and balances the common samples and rare samples during training based on the cluster results. A repetition clip operation is introduced to mitigate the overfitting issue led by samples from certain clusters. Extensive experiments validate the effectiveness of ClusterClip Sampling, which outperforms random sampling and other cluster-based sampling variants under various training datasets and LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Semdedup: Data-efficient learning at web-scale through semantic deduplication. CoRR, abs/2303.09540.
  2. Llemma: An open language model for mathematics. CoRR, abs/2310.10631.
  3. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, volume 382 of ACM International Conference Proceeding Series, pages 41–48. ACM.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  5. Data distributional properties drive emergent in-context learning in transformers. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  6. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
  7. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  8. Together Computer. 2023. Redpajama: an open dataset for training large language models.
  9. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 5547–5569. PMLR.
  10. Doge: Domain reweighting with generalization estimation. CoRR, abs/2310.15393.
  11. The pile: An 800gb dataset of diverse text for language modeling. CoRR, abs/2101.00027.
  12. Jina embeddings 2: 8192-token general-purpose text embeddings for long documents. CoRR, abs/2310.19923.
  13. Guy Hacohen and Daphna Weinshall. 2019. On the power of curriculum learning in training deep networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 2535–2544. PMLR.
  14. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  15. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
  16. Mistral 7b. CoRR, abs/2310.06825.
  17. Jean Kaddour. 2023. The minipile challenge for data-efficient language models. CoRR, abs/2304.08442.
  18. Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca.
  19. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. CoRR, abs/2305.13169.
  20. When less is more: Investigating data pruning for pretraining llms at scale. CoRR, abs/2309.04564.
  21. Can a suit of armor conduct electricity? A new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2381–2391. Association for Computational Linguistics.
  22. Prioritized training on points that are learnable, worth learning, and not yet learnt. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 15630–15649. PMLR.
  23. Scaling data-constrained language models. CoRR, abs/2305.16264.
  24. Orca: Progressive learning from complex explanation traces of GPT-4. CoRR, abs/2306.02707.
  25. Codegen2: Lessons for training llms on programming and natural languages. CoRR, abs/2305.02309.
  26. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  27. Code llama: Open foundation models for code. CoRR, abs/2308.12950.
  28. Slimpajama-dc: Understanding data combinations for LLM training. CoRR, abs/2309.10818.
  29. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13003–13051. Association for Computational Linguistics.
  30. InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
  31. D4: improving LLM pretraining via document de-duplication and diversification. CoRR, abs/2308.12284.
  32. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  33. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  34. Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 3261–3275.
  35. Data management for large language models: A survey. CoRR, abs/2312.01700.
  36. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  37. Sheared llama: Accelerating language model pre-training via structured pruning. CoRR, abs/2310.06694.
  38. Doremi: Optimizing data mixtures speeds up language model pretraining. CoRR, abs/2305.10429.
  39. Data selection for language models via importance resampling. CoRR, abs/2302.03169.
  40. Judging llm-as-a-judge with mt-bench and chatbot arena. CoRR, abs/2306.05685.
  41. G. K. Zipf. 1949. Human behavior and the principle of least effort.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.