Papers
Topics
Authors
Recent
Search
2000 character limit reached

SlimPajama: Efficient LLM Pretraining Corpus

Updated 2 July 2026
  • SlimPajama is a rigorously deduplicated, multi-domain corpus of 627B tokens designed for efficient large language model pretraining.
  • Its processing pipeline uses both local and global deduplication strategies to remove redundancy and maximize content diversity from sources like web text, code, and scientific literature.
  • Empirical results show that models trained on SlimPajama gain 2–5.5% improvements across tasks, highlighting its impact on enhancing model efficiency and generalization.

SlimPajama is a rigorously cleaned and deduplicated English-language text corpus consisting of 627 billion tokens, derived from the RedPajama dataset. It is designed as a high-quality, open-source resource for LLM pretraining, with specific emphasis on minimizing redundancy while maximizing content diversity across major domains including web text, code, books, scientific papers, encyclopedic entries, and community Q&A. Owing to its careful curation and large scale, SlimPajama has become a foundational dataset for both compact and high-parameter open-source LLMs, as well as for research into data efficiency, diversity, and optimal corpus construction.

1. Corpus Composition and Source Domains

SlimPajama is assembled by Cerebras as a “slimmed down” version of RedPajama, which itself is a replication of the LLaMA-style pretraining mix. The final corpus contains 627 billion tokens sourced from seven main domains, each undergoing extensive filtering and deduplication:

Source Post-dedup Percent Tokens (B)
CommonCrawl 52.2% 333
C4 26.7% 167
GitHub 5.2% 33
Books 4.2% 27
arXiv 4.6% 29
Wikipedia 3.8% 24
StackExchange 3.3% 21

This domain mix was selected to ensure coverage of general web content, technical writing, long-form prose, computer code, scientific literature, encyclopedic knowledge, and structured Q&A, providing a broad base for learning both generic and domain-specific language patterns (Dey et al., 2023, Shen et al., 2023, Agarwalla et al., 2024, Fan et al., 29 Apr 2025, Gupta et al., 2023).

2. Preprocessing, Deduplication, and Filtering Pipeline

SlimPajama implements a multi-stage preprocessing protocol to remove noise and redundancy:

  1. Low-length Document Filtering: Any document below 200 characters is dropped, removing 1.86% of candidate documents.
  2. Local Deduplication: Byte-level deduplication is performed independently within each data source, removing domain-internal duplicates—e.g., up to 63.8% of CommonCrawl bytes and 46.2% of GitHub bytes.
  3. Global Deduplication via MinHashLSH: An aggressive locality-sensitive hashing approach is applied across all sources. Documents are shingled (k-grams), and pairs with Jaccard similarity above a tuned threshold are eliminated. The use of global deduplication is critical, nearly halving dataset size and bringing byte redundancy down by 49.6%.
  4. Tokenization: After the above stages, data are tokenized using a GPT-2 or Llama-compatible Byte Pair Encoding (BPE) vocabulary.

The resulting corpus is both compact and highly diverse, with systematic cross-source overlaps removed, greatly reducing the risk of overfitting to repeated material or inadvertent memorization of training content (Dey et al., 2023, Shen et al., 2023).

3. Data Access, Splits, and Contextual Usage

SlimPajama is released publicly under the Apache 2.0 license via Hugging Face. It includes explicit train, validation, and test splits (each of the latter containing 0.5 billion tokens with zero overlap with training) to facilitate reproducible research and standardized benchmarking. All token counts and sampling ratios are encoded in the dataset metadata.

For model pretraining workflows, SlimPajama supports advanced sampling protocols such as multi-phase context-length sampling. In BTLM-3B-8K model training, the dataset was consumed in two phases (75% of tokens at context length 2,048, 25% at 8,192), supporting both standard and long-context learning (Dey et al., 2023).

4. Empirical Impact and Benchmark Results

SlimPajama’s rigorously deduplicated, multi-domain content directly underpins improvements in LLM efficiency and downstream generalization:

  • Models trained on SlimPajama, such as BTLM-3B-8K, exhibit 2–5.5% average gains across tasks relative to prior 3B-parameter baselines and are competitive with 7B-parameter models, despite significant reductions in pretraining compute and data requirements (Dey et al., 2023).
  • SlimPajama-DC ablations show that increasing domain diversity after global deduplication results in strictly superior average performance compared to single-domain or locally deduplicated mixes; for example, a 1.3B model on the full SlimPajama mix outperforms the same-size RedPajama model by 2.0 points, and yields +5.2 gain on HellaSwag and +2.0 on MMLU (Shen et al., 2023).
  • Data efficiency studies reveal that using a small, carefully diversified subset (e.g., 1.5% of SlimPajama files selected by the DiSF algorithm) can match or outperform full-dataset training, saving ≈98.5% of files and providing ≈1.5× training efficiency and 5× data efficiency, with consistent gains across Harness, MMLU, and BBH tasks (Fan et al., 29 Apr 2025).

5. SlimPajama as a Case Study in Corpus Construction

Several recent studies use SlimPajama to probe best practices in corpus curation and LLM pretraining:

  • Deduplication Strategies: Direct empirical comparison shows global deduplication (cross-domain) is essential to remove costly overlaps and concentrate compute on novel content. This is computationally intensive at trillion-token scale, typically requiring 1.4 TB RAM and extensive parallelism (Shen et al., 2023).
  • Data Diversity: After deduplication, models trained on purely web-crawl data can excel on some benchmarks, but multi-source mixtures maintain balanced performance across a much broader range of tasks.
  • Sparse LLM Pretraining: Augmenting SlimPajama with a small quantity of deduplicated code tokens (e.g., Python from The Stack) enables high-sparsity foundational LLM pretraining with no loss of accuracy up to 70% sparsity, and recovers 91.8–96.1% of baseline model performance, enabling efficient LLM deployment via sparsity and quantization (Agarwalla et al., 2024).
  • Continual Pretraining: SlimPajama serves as the downstream corpus in continual pretraining protocols, where models initialized on the Pile and then “rewarmed” on SlimPajama outperform scratch-trained models on both downstream (SlimPajama) and upstream (Pile) validation loss. Learning rate jump (“rewarming”) is essential to stabilize adaptation and balance positive transfer against catastrophic forgetting (Gupta et al., 2023).

6. Practical Recommendations and Limitations

Best practices identified with SlimPajama include:

  • Early execution of global deduplication to ensure data efficiency.
  • Assembly of at least 5–7 distinct, high-quality sources with balanced weights post-deduplication for optimal task generalization.
  • Use of context-length sampling schedules to enable long-context modeling without explosion in sequence lengths.
  • Careful selection or diversification of training files under data or compute budgets using algorithms such as DiSF.
  • For continual or sparse pretraining, combination of SlimPajama with targeted supplemental data (e.g., code) yields robust recovery and transferability.

Identified limitations include the high resource requirements for global deduplication at trillion-token scale and SlimPajama’s present English-language focus. Expansion to multilingual content and more scalable deduplication methods remain open directions (Shen et al., 2023, Fan et al., 29 Apr 2025).

7. Licensing, Public Availability, and Downstream Use

SlimPajama is openly available under Apache 2.0 and hosted via Hugging Face (https://huggingface.co/datasets/cerebras/SlimPajama-627B). The preprocessing pipeline (filtering, MinHashLSH deduplication, tokenization) is distributed with full source code. There are no further legal or usage restrictions beyond the license. Derivative evaluations, models (e.g., BTLM-3B-8K, SlimPajama-DC series), and specialized subsets (e.g., for sparse pretraining) are also openly released, enabling broad adoption for both academic and applied LLM research (Dey et al., 2023, Shen et al., 2023, Agarwalla et al., 2024).


In summary, SlimPajama represents a modern, multi-domain, low-redundancy, large-scale corpus supporting both efficient baseline LLM pretraining and advanced research into data-centric NLP, efficiency, and transfer learning. Its rigorous design and empirical validation have positioned it as a core dataset for open-source LLM development and for the study of data-centric algorithmic advances.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SlimPajama.