SlimPajama-DC: Understanding Data Combinations for LLM Training (2309.10818v3)

Published 19 Sep 2023 in cs.CL and cs.AI

Abstract: This paper aims to understand the impacts of various data combinations (e.g., web text, Wikipedia, GitHub, books) on the pretraining of LLMs using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T token RedPajama dataset contributed by Together. We have termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of LLMs. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations on SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16$\times$ CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our SlimPajama-DC models are available at: https://huggingface.co/MBZUAI-LLM/SlimPajama-DC and the separate SlimPajama-DC datasets are available at: https://huggingface.co/datasets/MBZUAI-LLM/SlimPajama-627B-DC.

References (42)

Citations (41)

View on Semantic Scholar

Summary

The paper demonstrates that global deduplication and diverse data combinations significantly enhance LLM training outcomes.
It details rigorous preprocessing using MinHashLSH to efficiently deduplicate over 1 trillion tokens for optimal dataset quality.
Results show that configurations with varied data sources achieve more stable training dynamics and superior benchmark accuracy.

Empirical Analysis of Data Combinations in LLM Training Using SlimPajama

The paper investigates the implications of data combinations for pretraining LLMs by leveraging the SlimPajama dataset. SlimPajama is derived from the larger RedPajama dataset through rigorous deduplication processes, targeting both global and local redundancies to enhance training data efficacy. This paper, coined as SlimPajama-DC, critically evaluates the impact of diverse data combinations on LLM training outcomes, providing both practical insights and theoretical contributions to the field of machine learning.

Dataset Composition and Preprocessing

SlimPajama is constructed from 632 billion tokens originating from sources such as CommonCrawl, Wikipedia, GitHub, and books. The dataset undertakes comprehensive deduplication, reducing the initial 1.2 trillion tokens from RedPajama to minimize redundancies that can impede model generalization and efficiency. Two primary preprocessing steps are highlighted: the removal of low-length documents and a global deduplication strategy aimed at eliminating duplicate entries across multiple datasets. The paper specifies MinHashLSH as the deduplication technique, optimized for trillion-token scalability by refining memory usage and parallel processing.

Data Combination Configurations

The empirical setup involves assessing six distinct data configurations from SlimPajama, each varying in the proportions of its constituent sources. Training is conducted on the Cerebras-GPT architecture using 1.3B parameters, with adaptation mechanisms such as Alibi positional embeddings and SwiGLU activations. Results demonstrate that configurations incorporating diverse datasets generally outperformed those heavily weighted towards single sources, underscoring the benefits of data diversification. The global deduplication feature within SlimPajama consistently highlighted improved training efficiency and model performance when compared to models trained on datasets using local deduplication.

Evaluation Metrics and Outcomes

The paper employs several benchmarks from the Eleuther LLM Evaluation Harness to gauge model performance, including ARC, HellaSwag, MMLU, and TruthfulQA. Among the configurations, DC-6, which integrates a diverse range of datasets, achieved the highest performance in terms of average accuracy. This underlines the importance of comprehensive data amalgamation as a strategy for pushing model generalization and predictive accuracy.

Training Dynamics and Loss Analysis

Loss curves indicate that larger data diversity within configurations aligns with more stable training loss patterns and smoother convergence properties. This supports the hypothesis that data diversity exerts an influence on training dynamics, potentially buffering models against overfitting tendencies associated with less diverse datasets.

Implications and Future Directions

The findings emphasize that a diversified set of training data with global deduplication strategies is pivotal in optimizing LLM training. This approach not only contributes to reducing computational overhead but also facilitates superior model generalization and robustness. As the field advances towards increasingly large-scale models, the SlimPajama-DC framework presents a valuable methodology for dataset construction aimed at efficient and effective AI training.

The practical implementation of progressive enhancements such as weight decay variations further elucidates the potential to mitigate overfitting, presenting opportunities for future research into adaptive training regimes and data-centric methodologies that cater to large batch processing in neural network training. Furthermore, the incorporation of specialized datasets into mainstream training workflows signals avenues for continued research in refining and expanding the adaptability of LLMs.

Related Papers

Tweets

https://twitter.com/georgejrjrjr/status/1782524634711101635

YouTube

Show All Videos