Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pile Dataset: A Diverse English Corpus

Updated 13 May 2026
  • The Pile dataset is a large-scale, 825 GiB curated corpus of English text featuring high domain diversity and rigorous documentation.
  • It aggregates data from 22 varied sources including academic, code, legal, and conversational texts, enabling reproducible research.
  • The MiniPile 6 GiB subset allows for efficient experimentation while maintaining the domain balance of the full dataset.

The Pile dataset is an 825 GiB English-language corpus of human-authored text, designed and assembled by EleutherAI for training large-scale LLMs. Its defining features are high domain diversity, rigorous documentation, and public availability for reproducible research. A “MiniPile” 6 GiB stratified subset has also been curated to enable data-efficient experimentation with comparable domain coverage. The Pile and its derivatives underpin a significant fraction of contemporary open-source language modeling research due to their breadth, quality, and comprehensive documentation (Gao et al., 2020, Biderman et al., 2022, Kaddour, 2023).

1. Motivation and Goals

The Pile addresses the limitations of earlier corpora—most notably the predominant use of filtered Common Crawl (CC)—by providing increased diversity, cross-domain coverage, and transparent provenance. Its objectives are: (1) to serve as a unified, high-quality training dataset supporting state-of-the-art and baseline transformer models; (2) to function as a robust evaluation benchmark for in-distribution and cross-domain generalization; and (3) to facilitate controlled ablations and domain-specific studies via well-documented, reproducible preprocessing and component splits (Gao et al., 2020, Biderman et al., 2022).

2. Source Composition and Structure

The Pile is assembled from 22 constituent sources, each drawn from distinct genres and publication contexts. This composition spans academic (arXiv, PubMed Central, PhilPapers), code (GitHub, StackExchange), legal (FreeLaw), conversational (OpenSubtitles, Ubuntu IRC, HackerNews), literary (Books3, Project Gutenberg, BookCorpus2), governmental documents (USPTO, NIH ExPORTER), web pages (Pile-CC, OpenWebText2, Wikipedia), and more. The largest sources by volume are Pile-CC (27.5%), Books3 (12.2%), GitHub (11.5%), PubMed Central (10.9%), and OpenWebText2 (7.6%). Each domain exhibits distinct statistical, syntactic, and topical distributions, contributing to robust cross-domain modeling capacities (Biderman et al., 2022, Gao et al., 2020).

Source Category Example Subset Share (%)
General Web Pile-CC, OpenWebText2 35.1
Books & Literature Books3, Gutenberg, BookCorpus2 20.9
Academic/Scientific arXiv, PubMed, PhilPapers 16.2
Code/Q&A GitHub, StackExchange 15.5
Legal/Government FreeLaw, USPTO, NIH 8.5
Dialogue/Conversational OpenSubtitles, IRC, HackerNews 2.5

Share percentages approximate and non-exclusive due to up/down-sampling in epoch mixes.

3. Data Collection, Preprocessing, and Quality Control

The Pile’s construction pipeline includes scraping or bulk downloading source data, format normalization (removing HTML, metadata, or XML, and converting to plain text), language detection and filtering (maintaining primarily English text), deduplication within major web-based sources (via MinHashLSH), and quality-based pruning (removal of boilerplate, short or malformed texts, and excessive duplicates). Tokenization uses a GPT-2 trained Byte-Pair Encoding (BPE) vocabulary of approximately 50,000 tokens, and the final dataset is randomly shuffled and split into shards for scalable downstream use. The primary dataset comprises ~211 million documents averaging ~4 KiB apiece, with a long right tail toward megabyte-length items (especially in books and code) (Gao et al., 2020, Biderman et al., 2022).

4. Subsampling and the MiniPile Subset

Recognizing the prohibitive computational cost and environmental footprint of full-scale pre-training, a stratified 6 GiB “MiniPile” subset of approximately 1 million documents was introduced (Kaddour, 2023). This subset is designed to preserve the Pile’s cross-domain balance while being accessible for smaller-scale experiments. Its curation employs embedding-based clustering with E5-Large transformer encodings, followed by kk-means clustering (K=220K=220, cosine distance) and manual exclusion of clusters identified as including low-quality, duplicative, or non-informative texts. The domain shares in MiniPile are maintained at: code (38%), books (30%), dialogue (10%), science (10%), and web (8%), reflecting the original full Pile composition.

5. Benchmarking and Model Evaluation

Extensive downstream evaluations validate the Pile’s utility as a training and testbed corpus:

  • Perplexity and Bits-per-Byte (bpb) Benchmarks: Trained LMs on the Pile achieve substantially lower bpb values and perplexity on challenging academic and code subdomains than models trained on CC-only or less diverse datasets.
  • Downstream Fine-tuning: MiniPile-based models (BERT, T5) incur only modest drops (1.9–2.5 percentage points on GLUE and SNI) compared to reference models trained on up to 745× more data, indicating that coverage and curation, not just dataset size, drive generalization (Kaddour, 2023).
  • Ablations: Removing or replacing key subdomains directly impacts in-domain and zero-shot test performance, particularly on unfamiliar or technical tasks.

6. Licensing, Ethics, and Distribution

The Pile is released under an MIT License, but with the critical caveat that individual subsets retain their original licensing—ranging from public domain (e.g., Gutenberg, FreeLaw), MIT-like (DM Math), and CC-BY-SA (Wikipedia, StackExchange) to more ambiguous (Books3, GitHub). EleutherAI asserts that included copyrighted works are compatible with U.S. fair use, but local copyright law varies; the dataset includes guidance and code to reconstruct subsets or exclude specific components. Societal biases, inclusion of personally identifiable information (PII), and toxic language are present to varying degrees, particularly in conversational and web-mined subsets. Users are advised to conduct downstream audits and employ filtering or re-weighting when necessary (Biderman et al., 2022, Gao et al., 2020).

7. Impact, Limitations, and Future Directions

The Pile has become a de facto benchmark and pretraining resource for open-source LLMs such as GPT-Neo, GPT-J, and GPT-NeoX, facilitating research into scaling laws, robustness, and data curation strategies. Its construction highlights the necessity and challenge of balancing breadth, quality, and ethical considerations. Remaining limitations include residual domain and topical imbalances, persistence of societal biases, and uncertainties regarding downstream generalization outside English. Potential future directions involve data pruning for maximal efficiency, active learning-driven dataset expansion, targeted domain up-sampling, and the development of multilingual Pile-style corpora (Kaddour, 2023, Gao et al., 2020, Biderman et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pile Dataset.