Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SlimPajama-DC: Understanding Data Combinations for LLM Training (2309.10818v3)

Published 19 Sep 2023 in cs.CL and cs.AI

Abstract: This paper aims to understand the impacts of various data combinations (e.g., web text, Wikipedia, GitHub, books) on the pretraining of LLMs using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T token RedPajama dataset contributed by Together. We have termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of LLMs. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations on SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16$\times$ CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our SlimPajama-DC models are available at: https://huggingface.co/MBZUAI-LLM/SlimPajama-DC and the separate SlimPajama-DC datasets are available at: https://huggingface.co/datasets/MBZUAI-LLM/SlimPajama-627B-DC.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  2. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
  3. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, Mar. 2021. If you use this software, please cite it using these metadata.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  7. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023.
  8. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  9. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  11. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208, 2023.
  12. Nathan Habib Sheon Han Nathan Lambert Nazneen Rajani Omar Sanseviero Lewis Tunstall Thomas Wolf Edward Beeching, Clémentine Fourrier. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  13. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  14. A framework for few-shot language model evaluation, Sept. 2021.
  15. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964, 2020.
  16. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021.
  17. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems, 30, 2017.
  18. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pages 2525–2534. PMLR, 2018.
  19. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
  20. Bayesian estimates of equation system parameters: an application of integration by monte carlo. Econometrica: Journal of the Econometric Society, pages 1–19, 1978.
  21. The stack: 3 tb of permissively licensed source code. Preprint, 2022.
  22. Mining of massive data sets. Cambridge university press, 2020.
  23. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, 2022.
  24. Dropout reduces underfitting. In ICML, 2023.
  25. S2orc: The semantic scholar open research corpus. arXiv preprint arXiv:1911.02782, 2019.
  26. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  27. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  28. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  29. Improving language understanding by generative pre-training. 2018.
  30. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  31. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  32. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  33. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023.
  34. Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536, 2022.
  35. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pages 10–19, 2019.
  36. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  37. https://github.com/mosaicml/llm-foundry. Llm foundry. Mosaicml, 2023.
  38. https://www.mosaicml.com/blog/mpt-7b. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Mosaicml blog, 2023.
  39. Doremi: Optimizing data mixtures speeds up language model pretraining. arXiv preprint arXiv:2305.10429, 2023.
  40. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169, 2023.
  41. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  42. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Citations (41)

Summary

  • The paper demonstrates that global deduplication and diverse data combinations significantly enhance LLM training outcomes.
  • It details rigorous preprocessing using MinHashLSH to efficiently deduplicate over 1 trillion tokens for optimal dataset quality.
  • Results show that configurations with varied data sources achieve more stable training dynamics and superior benchmark accuracy.

Empirical Analysis of Data Combinations in LLM Training Using SlimPajama

The paper investigates the implications of data combinations for pretraining LLMs by leveraging the SlimPajama dataset. SlimPajama is derived from the larger RedPajama dataset through rigorous deduplication processes, targeting both global and local redundancies to enhance training data efficacy. This paper, coined as SlimPajama-DC, critically evaluates the impact of diverse data combinations on LLM training outcomes, providing both practical insights and theoretical contributions to the field of machine learning.

Dataset Composition and Preprocessing

SlimPajama is constructed from 632 billion tokens originating from sources such as CommonCrawl, Wikipedia, GitHub, and books. The dataset undertakes comprehensive deduplication, reducing the initial 1.2 trillion tokens from RedPajama to minimize redundancies that can impede model generalization and efficiency. Two primary preprocessing steps are highlighted: the removal of low-length documents and a global deduplication strategy aimed at eliminating duplicate entries across multiple datasets. The paper specifies MinHashLSH as the deduplication technique, optimized for trillion-token scalability by refining memory usage and parallel processing.

Data Combination Configurations

The empirical setup involves assessing six distinct data configurations from SlimPajama, each varying in the proportions of its constituent sources. Training is conducted on the Cerebras-GPT architecture using 1.3B parameters, with adaptation mechanisms such as Alibi positional embeddings and SwiGLU activations. Results demonstrate that configurations incorporating diverse datasets generally outperformed those heavily weighted towards single sources, underscoring the benefits of data diversification. The global deduplication feature within SlimPajama consistently highlighted improved training efficiency and model performance when compared to models trained on datasets using local deduplication.

Evaluation Metrics and Outcomes

The paper employs several benchmarks from the Eleuther LLM Evaluation Harness to gauge model performance, including ARC, HellaSwag, MMLU, and TruthfulQA. Among the configurations, DC-6, which integrates a diverse range of datasets, achieved the highest performance in terms of average accuracy. This underlines the importance of comprehensive data amalgamation as a strategy for pushing model generalization and predictive accuracy.

Training Dynamics and Loss Analysis

Loss curves indicate that larger data diversity within configurations aligns with more stable training loss patterns and smoother convergence properties. This supports the hypothesis that data diversity exerts an influence on training dynamics, potentially buffering models against overfitting tendencies associated with less diverse datasets.

Implications and Future Directions

The findings emphasize that a diversified set of training data with global deduplication strategies is pivotal in optimizing LLM training. This approach not only contributes to reducing computational overhead but also facilitates superior model generalization and robustness. As the field advances towards increasingly large-scale models, the SlimPajama-DC framework presents a valuable methodology for dataset construction aimed at efficient and effective AI training.

The practical implementation of progressive enhancements such as weight decay variations further elucidates the potential to mitigate overfitting, presenting opportunities for future research into adaptive training regimes and data-centric methodologies that cater to large batch processing in neural network training. Furthermore, the incorporation of specialized datasets into mainstream training workflows signals avenues for continued research in refining and expanding the adaptability of LLMs.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com