RedPajama Dataset for LLM Research

Updated 25 October 2025

RedPajama dataset is an open-source, large-scale text corpus designed for pretraining LLMs, featuring rigorous documentation and quality signals.
It aggregates multi-domain data—from CommonCrawl to Wikipedia—using tailored pipelines and advanced filtering to ensure reproducibility and high data quality.
Its modular versions, including a curated V1 and a massive V2 with deduplication metadata, enable flexible use and critical evaluation in LLM training and deployment.

The RedPajama dataset is an open-source resource designed as a foundation for pretraining LLMs. It provides large-scale text corpora spanning multiple domains, rigorous documentation, extensive metadata, and sophisticated filtering signals to address transparency, data quality, and reproducibility challenges in LLM development. RedPajama is available in two primary versions: RedPajama-V1, an open reproduction of the LLaMA training dataset aggregating several canonical sources, and RedPajama-V2, a massive, web-only corpus with accompanying quality signals and deduplication artifacts. With over 100 trillion tokens in its largest incarnation, RedPajama is one of the most expansive publicly-accessible datasets for language modeling research.

1. Dataset Versions and Composition

RedPajama-V1 aggregates seven primary data sources: CommonCrawl, C4, GitHub, Wikipedia, Books, ArXiv, and StackExchange, yielding approximately 1.2 trillion tokens. Each domain is processed with tailored pipelines:

CommonCrawl data is filtered using CCNet, which utilizes perplexity bucketing ("head," "middle," "tail") via a Kneser-Ney 5-gram model trained on Wikipedia. Only the "head" and "middle" buckets are retained for higher quality.
Wikipedia, ArXiv, and Books undergo specialized cleaning and reference classification to eliminate low-quality entries.
GitHub data is restricted to projects under permissive licenses (Apache, BSD, MIT), with additional heuristic filtering (e.g., maximum line length, file extension whitelisting).

RedPajama-V2 forgoes engineered curation in favor of scale and flexibility. It consists of raw, unfiltered CommonCrawl web data from 84 monthly snapshots spanning 2014–2023, yielding over 100 trillion tokens. Rather than excluding noisy or low-quality samples by default, V2 preserves the full corpus and augments each document with nearly 40 quality signals and deduplication metadata, enabling downstream researchers to define custom, principled filtering criteria. RedPajama-V2 partitions its corpus by language (e.g., English, German, French, Spanish, Italian) and subdivides by quality and perplexity buckets.

2. Quality Signals, Metadata, and Deduplication

RedPajama datasets include elaborate metadata to enable reproducible curation and analysis. Key signals and artifacts:

Natural language heuristics: Measures such as fraction of all-caps words, lines ending in ellipses, fraction of unique words, average word length, unigram entropy, and lexical diversity.
Repetitiveness indicators: Frequency and coverage of duplicated n-grams at various window sizes (e.g., character-level for 2–4-grams, phrase-level for 5–10 word n-grams).
Content filters: Flags for harmful or off-topic content based on blocklists (e.g., LDNOOBW), URL filtering, and curated domain blocklists.
ML-based signals: FastText classifiers measuring similarity to high-quality domains (Wikipedia, Books, OpenWebText), and DSIR importance weights computed as log-likelihood ratios across model distributions.
Deduplication artifacts: Each document is annotated with MinHash and Bloom-filter flags, supporting both fuzzy (near-duplicate) and exact deduplication. For example, MinHash deduplication in RedPajama-V2 uses 128 hash functions, partitioned into bands and rows for scalable Jaccard similarity thresholding (e.g., a threshold of 0.8).

3. Data Quality, Biases, and Content Analysis

Systematic analyses reveal several important aspects of RedPajama data quality:

Duplication: Approximately 50% of RedPajama documents are exact duplicates, with ~218 million duplicate clusters. This redundancy results from oversampling, templated web content, and machine-generated pages (Elazar et al., 2023).
Synthetic and low-quality text: High frequency of boilerplate, algorithmic, or synthetic content is evident in anomalous n-gram and document length statistics.
PII and toxicity: The dataset contains significant quantities of personally identifiable information—on the order of 35 million email addresses, 70 million phone numbers, and 1.1 million IP addresses—identified via regular expressions with post-processing filters. Toxicity is present in up to 10.3% of documents as measured by classifiers (Elazar et al., 2023).
Benchmark contamination: Evaluation sets (e.g., GLUE, SuperGLUE, COPA) are heavily contaminated, with test data appearing verbatim in the training corpus. COPA contamination is complete and eight out of fifteen evaluated benchmarks have >50% contamination rates, leading to inflated model performance when assessed on these tasks.
Biases and classifier detectability: Despite similar extraction and filtering pipelines to other CommonCrawl-based datasets, RedPajama-V2 displays a distinct bias "fingerprint." Transformer-based classifiers can accurately distinguish RedPajama-V2 from C4 or RefinedWeb with ~80% accuracy in multi-class settings and >90% accuracy in binary settings. These dataset-specific biases persist after LLM training and show up in generated outputs, affecting downstream generalization and mixture proportion estimation (Mansour et al., 3 Dec 2024).

4. Filtering, Deduplication, and Best Practices

The corpus is designed for downstream flexibility; researchers may apply custom filtering and deduplication using distributed hash-based methods and thresholded quality signals.

Global deduplication: Advanced deduplication strategies such as MinHashLSH are applied globally across data sources. For SlimPajama (a rigorously deduplicated subset of RedPajama), document signatures are computed using lowercase 13-grams, and a Jaccard similarity threshold of 0.8 is used for deduplication (Shen et al., 2023). For trillion-token scale, deduplication is parallelized over 64 CPU cores with a peak memory usage of ~1.4TB.
Filtering heuristics: Subsets are often filtered to mimic procedures used in Gopher or RefinedWeb, e.g., imposing thresholds on language ID scores, word count, word length ( $3 < \text{mean word length} < 10$ ), and symbol-to-word ratios (Herold et al., 17 Jun 2024).
Content curation and ablation: Analyses demonstrate that leveraging quality signals, deduplication, and mixing multiple high-quality sources substantially improves downstream LLM performance across natural language understanding tasks, multiple-choice reasoning benchmarks, and coreference resolution (Weber et al., 19 Nov 2024).

5. Impact on LLM Training and Deployment

RedPajama datasets have already been used for training production-scale LLMs (e.g., Snowflake Arctic, Salesforce XGen, AI2 OLMo (Weber et al., 19 Nov 2024), and eBay’s LiLiuM (Herold et al., 17 Jun 2024)). Empirical studies show:

Sample efficiency and generalization: Eliminating redundancy and balancing domain representation improves the effectiveness of each training token and enhances model generalization (Shen et al., 2023).
Multilingual breadth: RedPajama-V2 is notably multilingual (36% non-English after filtering), supporting competitive results in both English and non-English tasks, including machine translation (Herold et al., 17 Jun 2024).
Benchmark results: Models trained on deduplicated and diverse mixtures from RedPajama (e.g., SlimPajama-DC) outperform those trained on the original dataset across standardized NLU benchmarks (Shen et al., 2023).
Filtering’s effect on performance: Applying quality signal filtering and deduplication produces results comparable to or slightly better than curated datasets like RefinedWeb, especially when tailored for target application domains (Herold et al., 17 Jun 2024).

6. Dataset Inference and Copyright Considerations

Recent research introduces dataset inference techniques to determine whether a given model was trained on RedPajama or its subsets (Maini et al., 10 Jun 2024). The key approach:

Aggregates multiple Membership Inference Attack (MIA) signals for a set of candidate samples.
Uses linear regression to combine features and performs statistical hypothesis testing (e.g., t-test on membership scores) to distinguish between training and non-training sets, with p-values < 0.1 indicating likely training set inclusion and low rates of false positives.
This methodology enables creators and regulators to verify unauthorized use of proprietary data, informing copyright enforcement and fair use debates.

7. Open Challenges, Future Directions, and Recommendations

The dataset’s vast scale and flexible filtering pose ongoing challenges:

Transparency: Comprehensive documentation of sources, pipelines, filtering signals, and deduplication processes is emphasized to promote reproducibility and openness (Weber et al., 19 Nov 2024).
Bias mitigation and analysis: Studies underscore the persistence of dataset-specific formatting, vocabulary, and content biases in both raw data and generated outputs, encouraging improved pipeline standardization and more robust generalization strategies (Mansour et al., 3 Dec 2024).
Continued refinement: Developing adaptive filtering, dynamic threshold setting, and advanced deduplication may further improve data quality and model performance.
Ethical risks: Presence of PII and toxic content requires stringent curation and filtering before model deployment (Elazar et al., 2023).
Evaluation and reporting: Contamination of evaluation benchmarks necessitates explicit reporting and analysis of training/test splits to uphold fair model comparison (Elazar et al., 2023).

Table: Summary of Key RedPajama Dataset Characteristics

Characteristic	RedPajama-V1	RedPajama-V2
Domain Sources	7 (CC, C4, GH, Wiki…)	Web-only (CC, 84 months)
Token Count	~1.2T	>100T
Deduplication	CCNet, source-level	MinHash, Bloom filters
Metadata	Domain, pipeline fields	~40 signals, dedup hashes
Filtering Flexibility	Predefined heuristics	User-defined via signals

In summary, RedPajama datasets constitute a foundational resource for LLM research: vast in scale, meticulously documented, and designed for transparency and reproducibility. The inclusion of extensive quality signals and deduplication metadata, together with rigorous ablation studies and real-world model deployments, underscores its significance for open, high-performance, and accountable LLM training.