SlimPajama-DC Framework
- SlimPajama-DC Framework is a systematic empirical study and dataset release investigating how data composition and deduplication strategies impact Large Language Model (LLM) quality and generalization.
- The framework empirically demonstrates that global deduplication across diverse data sources, rather than local deduplication or sheer data volume, significantly improves LLM performance on downstream benchmarks.
- All SlimPajama-DC models, the 627B-token deduplicated dataset, and the data preparation code are openly released to foster reproducibility and further research in LLM data curation.
SlimPajama-DC Framework refers to a systematic empirical investigation and release designed to elucidate how data composition and deduplication strategies within large-scale pretraining corpora affect the quality and generalization capabilities of LLMs. The framework centers on the SlimPajama dataset, a rigorously deduplicated, multi-source corpus distilled from RedPajama, and formalizes best practices for dataset mixture, deduplication, training infrastructure, and reproducible benchmarking—culminating in the publication of both data and open models for further research and application (SlimPajama-DC: Understanding Data Combinations for LLM Training, 2023).
1. Data Deduplication Strategies and Their Impact
A critical aspect of the framework is the comparison between local and global deduplication. Local deduplication refers to the elimination of duplicate documents within each dataset source (e.g., Wikipedia, GitHub) in isolation; it does not remove duplicates appearing across different sources. Global deduplication removes duplicate documents across the entire multi-source corpus, targeting redundancy both within and between all constituent datasets.
This is operationalized using MinHashLSH-based deduplication, employing document fingerprints computed from lowercase 13-grams and a Jaccard similarity threshold of 0.8 to compare document signatures. The process reduces the original 1.2T+ token RedPajama corpus to 627B tokens in SlimPajama, achieving nearly 50% overall byte-level deduplication (e.g., 63.76% in CommonCrawl, 46.16% in GitHub, 2.24% in Wikipedia). Empirically, global deduplication leads to higher LLM downstream evaluation metrics by limiting repeated exposure to identical or highly similar content, thus mitigating overfitting and enhancing generalization.
Local deduplication alone is insufficient for thorough redundancy reduction, as much overlap exists between sources (e.g., code snippets in both CommonCrawl and GitHub). Global deduplication, when coupled with increased data diversity, consistently improves model accuracy on multi-task benchmarks.
2. Composition of Data Mixtures and Performance Analysis
Six distinct dataset mixture configurations were constructed to evaluate the effects of various data proportions. Each configuration was trained from scratch on the deduplicated SlimPajama subset (627B tokens), using 330B tokens per model and the Cerebras-GPT 1.3B architecture with Alibi positional encoding and SwiGLU activation.
The configurations (termed DC-1 through DC-6, Editor's term) span from single-source (e.g., 100% CommonCrawl, DC-1) to highly diversified mixtures (e.g., DC-6: CommonCrawl, C4, GitHub, Books, ArXiv, Wikipedia, StackExchange). Performance on standard LLM benchmarks—ARC, HellaSwag, MMLU, and TruthfulQA—demonstrates that higher diversity in data composition post-global deduplication yields notable increases in model generalization and downstream performance. DC-6, the most diversified mix, achieved an average score of 40.0, outperforming a RedPajama baseline by more than 2 absolute points under identical training budgets.
Config | Datasets Included | Average Score |
---|---|---|
DC-1 | CommonCrawl-only | 38.5 |
DC-2 | 90.9% CommonCrawl, 9.1% GitHub | 38.4 |
DC-3 | 75.8% CommonCrawl, 24.2% GitHub | 38.5 |
DC-4 | 75.8% CommonCrawl, 24.2% Wikipedia | 37.6 |
DC-5 | Balanced: CommonCrawl, GitHub, Books, Wiki | 38.6 |
DC-6 | CommonCrawl, C4, GitHub, Books, ArXiv, ... | 40.0 |
Thus, the framework underscores that after global deduplication, maximizing diversity across high-quality sources (including code, encyclopedic, technical, and conversational subsets) is fundamental to producing high-performing, generalizable LLMs.
3. Training Infrastructure and Methodological Features
The models within SlimPajama-DC were trained on a Cerebras 16× CS-2 cluster, collectively furnishing 80 PFLOP/s of AI compute. All experiments used bf16 mixed-precision for efficiency, with training batches reaching up to 2M tokens per step for 1.3B-parameter models. This infrastructure permitted full corpus sweeps and rapid empirical ablations across data mixtures.
For scaling experiments (7B-parameter models), the training migrated to 232 NVIDIA A100 GPUs (80GB), achieving batch sizes of up to 14.3M tokens with techniques such as FSDP (Fully Sharded Data Parallel), activation checkpointing, and advanced scheduling—a Progressive Training on Weight Decay (PTWD) schedule was introduced, featuring a regime of zero weight decay, followed by heavy, then normal weight decay, to manage overfitting under large-batch conditions.
The backbone model architecture consistently utilized GPT-style decoder-only transformers, GPT-NeoX BPE tokenization (~50.3K vocab), Alibi positional encoding for effective long-context scaling, and SwiGLU non-linearity.
4. Empirical Discoveries Beyond Data Deduplication
A principal empirical finding is that simply lowering validation loss or increasing token count is less significant for generalization than enhancing dataset diversity after comprehensive deduplication. Lower loss does not guarantee higher accuracy on benchmarks such as MMLU, ARC, or TruthfulQA; instead, a diversified training mixture that reduces redundancy is more predictive of robust performance.
Scaling laws observed in 1.3B-parameter ablations generalized to 7B-parameter model training, confirming that principles of deduplication and diversity transfer across model sizes. The use of large batch-size training with the PTWD schedule enabled efficient, stable convergence, and competitive post-instruction-tuning results (e.g., 46.4 average on a suite of HuggingFace LLM leaderboards).
The framework also introduced and reported the RRGS metric (Random Response Guessing Score) to assess the risk that MMLU results are mere artifacts of chance, considering the 25% random baseline.
5. Open Data and Reproducibility
All SlimPajama-DC model weights (including 1.3B and 7B checkpoints), as well as the deduplicated data mixtures and the documented preprocessing codebase, are openly released and maintained via HuggingFace and GitHub:
- Models: https://huggingface.co/MBZUAI-LLM/SlimPajama-DC
- Datasets: https://huggingface.co/datasets/MBZUAI-LLM/SlimPajama-627B-DC
- Data Preprocessing Pipeline: https://github.com/Cerebras/modelzoo/tree/main/src/cerebras/modelzoo/data_preparation/nlp/slimpajama
The resources are explicitly available for research, benchmarking, and extension, enabling transparent comparison and extension by the research community.
6. Broader Implications for LLM Development
The SlimPajama-DC Framework provides robust empirical evidence supporting the conclusion that, for current LLM pretraining, prioritizing thoroughly deduplicated, highly diverse multi-source corpora—with architectures capable of leveraging long-range context—is essential for maximizing downstream accuracy and generalization. These insights are directly actionable for practitioners constructing new LLM pretraining pipelines, informing both data engineering and architectural choices. The explicit publication of datasets and models reinforces the framework as a foundation for reproducible, open research within the LLM ecosystem.
A plausible implication is that future corpus construction for open LLMs should pivot from sheer scale to conscientious curation, deduplication, and balancing of source diversity, especially as data reuse and model scaling continue to intensify within the field.