SlimPajama Dataset
- SlimPajama is a 627 billion token, globally deduplicated dataset derived from RedPajama, used for efficient and robust LLM pre-training.
- Models trained on SlimPajama show improved performance and generalization on benchmark tasks compared to those using less processed datasets.
- Featuring rigorous global deduplication and diverse sources, the open SlimPajama dataset supports reproducible research for high-performing LLMs.
SlimPajama is a rigorously deduplicated, large-scale, multi-source English-language corpus designed for the efficient and robust pre-training of LLMs. Derived from the RedPajama dataset, SlimPajama comprises 627 billion tokens curated from a broad array of sources including web documents, code, literature, academic, encyclopedic material, and Q&A forums. Its balanced construction, global deduplication, open availability, and empirical validation in multiple LLMs situate it as a reference dataset within the contemporary LLM research landscape.
1. Data Construction and Deduplication Strategy
SlimPajama is constructed by aggressive cleaning and deduplication of the original 1.2T-token RedPajama dataset, leading to a corpus of 627B tokens. The deduplication process is two-stage:
- Low-length Document Filtering: Documents containing fewer than 200 characters are removed after preprocessing, which targets metadata and low-value entries (about 1.86% of documents).
- Global Deduplication with MinHashLSH: Using MinHash locality-sensitive hashing (with a Jaccard similarity threshold of 0.8 on preprocessed lowercase 13-grams), near-duplicate documents are identified and eliminated across all source domains, rather than within each source. This global deduplication is distinct from local deduplication strategies seen in prior datasets (such as LLAMA, RedPajama, OPT, or Pile), where deduplication is restricted to individual sources prior to dataset merging.
The implementation is optimized for scaling to corpora with trillions of tokens, operating on 64 CPU cores and up to 1.4TB RAM. In SlimPajama, byte duplication rates after deduplication are reduced dramatically—for example, from 63.8% in CommonCrawl and 46.2% in GitHub to less than half those values in the global corpus (2309.10818).
2. Domain Composition and Proportions
SlimPajama’s token distribution across domains is designed to maximize linguistic and task diversity, which is critical for generalization in downstream tasks. The main composition (by token percentage) is:
Data Source | Proportion (%) |
---|---|
CommonCrawl | 52.2 |
C4 | 26.7 |
GitHub | 5.2 |
Books | 4.2 |
ArXiv | 4.6 |
Wikipedia | 3.8 |
StackExchange | 3.3 |
This multi-source mixture is outcome of empirical studies that show the importance of maximizing data diversity after global deduplication, resulting in improved generalization and robust performance on standardized LLM benchmarks (2309.10818).
3. Impact on LLM Pretraining and Performance
SlimPajama was purpose-built to address redundancy, overfitting, and poor generalization in LLM training. Studies with models up to 1.3B and 7B parameters (e.g., Cerebras-GPT and BTLM-3B-8K) demonstrate:
- Superior Benchmark Performance: Models trained on globally deduplicated SlimPajama (specifically diversified mixtures like configuration DC-6) outperform RedPajama-trained baselines (average accuracy 40.0 vs. 38.0 on HuggingFace leaderboard tasks) (2309.10818).
- Generalization and Task Transfer: Models trained on SlimPajama exhibit improved downstream accuracy, even at smaller parameter scales. For instance, BTLM-3B-8K achieves or surpasses several 7B-parameter model baselines on long-context and reasoning tasks using only 3B parameters, attributed to the quality and diversity of SlimPajama (2309.11568).
A key discovery is that lower training loss does not always translate to better downstream performance; diverse, globally deduplicated data is necessary to support broad generalist capabilities rather than overfitting to single domains.
4. Technical Innovations: Deduplication at Scale
SlimPajama’s deduplication pipeline was a notable advance for trillion-scale corpora construction. However, subsequent work has further improved deduplication throughput:
- FED Framework (GPU-Accelerated Deduplication): The Fast and Efficient Dataset Deduplication (FED) framework realizes up to 58.3× speedups over the SlimPajama CPU tool and is up to 8.6× faster than NVIDIA NeMo Curator on 100GB benchmarks, with hash signature generation up to 1,800× faster. FED uses a custom, low-precision, partially reusable hash function and exhaustively compares all pairs in each hash bucket, reducing false negatives in duplicate detection (2501.01046).
- Scalability: FED deduplicates 1.2T tokens in under 5.1 hours on a 16-GPU cluster, compared to multi-day runs with traditional pipelines. This enables rapid curation and updating of SlimPajama-scale datasets.
FED employs techniques such as double buffering, custom GPU kernels, and out-of-core processing, and it is publicly available for use in further dataset curation.
5. Data Mixture Optimization and Mixture-of-Expert Approaches
Optimizing the sampling proportions of domains in multi-source datasets like SlimPajama is crucial for downstream LLM performance. The Mixture of Data Experts (MDE) method provides a theoretically justified and sample-efficient approach:
- Expert Model Aggregation: Each domain is assigned an expert LLM. For any candidate mixture , predictions are a weighted average of the experts’ outputs, and the ensemble’s cross-entropy loss approximates the true loss.
- Empirical Validation: MDE enables accurate loss prediction and mixture selection with far fewer proxy mixtures and outperforms both parameter-interpolation and default mixture schemes on SlimPajama (2502.15950).
- Downstream Task Gains: Mixtures optimized with MDE regression features achieve superior few-shot performance across ten diverse downstream tasks (+1.3% over prior best averages).
Theoretical justification is provided for the optimality of the MDE approach under certain conditions, with further improvements possible by incorporating MDE estimates as regression model features.
6. Application in Continual and Efficient Pretraining
SlimPajama supports powerful continual pre-training workflows, a practice increasingly favored over full retraining for efficiency:
- Learning Rate Rewarming: Empirical studies using SlimPajama as downstream data demonstrate that models benefit from a renewed learning rate warmup schedule when switching pre-training datasets. This yields improved adaptation to SlimPajama while incurring a temporary loss spike, and produces models that outperform those trained from scratch (2308.04014).
- Checkpoint Selection and Optimization Artifacts: Using fully trained checkpoints as a start for continual pre-training is optimal; performance instability after rewarming is due mostly to optimization effects rather than dataset distribution shift alone.
Recommendations from these studies include routine use of learning rate warmup/decay, focus on max learning rate to control transfer vs. forgetting, and maximal overlap between pre-training checkpoint and new data domain.
7. Access, Availability, and Practical Adoption
SlimPajama and its variants (including SlimPajama-DC with various domain mixes) are fully open, along with preprocessing and deduplication implementations. This supports transparency and reproducibility in LLM research.
Resource | Access URL |
---|---|
SlimPajama dataset | https://huggingface.co/datasets/cerebras/SlimPajama-627B |
SlimPajama-DC datasets | https://huggingface.co/datasets/MBZUAI-LLM/SlimPajama-627B-DC |
Deduplication pipeline | https://github.com/Cerebras/modelzoo/tree/main/modelzoo/transformers/data_processing/slimpajama |
FED deduplication tool | https://github.com/mcrl/FED |
Both empirical and practical analyses verify that using globally deduplicated, maximally diversified data mixtures like SlimPajama is now a highly effective and efficient foundation for modern LLM pre-training, continual learning, and transfer research.