The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
(2506.05209v1)
Published 5 Jun 2025 in cs.CL and cs.LG
Abstract: LLMs are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0.1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0.1-1T and Comma v0.1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0.1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0.1 models.
The paper "The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text" (Kandpal et al., 5 Jun 2025) introduces and releases a large-scale text dataset specifically curated to contain only public domain and openly licensed content. The primary motivation is to address the intellectual property (IP) infringement and ethical concerns surrounding the common practice of training LLMs on vast amounts of unlicensed text scraped from the internet.
The Common Pile v0.1 is an 8TB collection of text gathered from 30 diverse sources, spanning domains such as research papers, code, books, encyclopedias, educational materials, audio transcripts, and government/legal documents. The authors define "openly licensed" text based on the Open Knowledge Foundation's Open Definition 2.1, which includes licenses like Creative Commons BY (CC BY), CC BY-SA, CC0, and Blue Oak Council-approved software licenses (like MIT), while explicitly excluding non-commercial (NC) and no-derivatives (ND) licenses.
A significant effort was dedicated to license due diligence. The authors highlight challenges like "license laundering" (where copyrighted work is redistributed with an incorrect license), the distinction between collection licenses (like ODC-By, which licenses the dataset compilation, not necessarily its contents) and the licenses of individual documents, and the ambiguity surrounding LLM-generated synthetic data. To mitigate these risks, they adopted strict sourcing standards, prioritizing content where the license was confidently provided by the copyright holder, leading to the exclusion of sources like OpenAlex and certain web scrapes with unclear status.
The composition of the Common Pile includes:
Scientific and Scholarly Text: Filtered versions of peS2o peS2o, PubMed Central openly licensed articles, and openly licensed articles and CC0 abstracts from ArXiv.
Online Discussions: Content from StackExchange dumps (CC BY-SA), GitHub Archive (issues, PRs, comments from Blue Oak Council-licensed repos), and public domain logs from Ubuntu IRC.
Government and Legal Texts: Public domain documents from the USGPO (Federal Register, reports), USPTO patents (Google Patents Public Data [googlepatentspublic]), public domain US court decisions (Caselaw Access Project, Court Listener), Open Parliament Licensed UK Hansard records [parlparse, opl], and US federal agency regulations from Regulations.gov.
Curated Task Data: Openly licensed datasets (matching specific criteria to avoid license laundering and ensure original content) sourced via the Data Provenance Initiative longpre2023data, longpre2024bridging.
Public Domain Books: Public domain books identified from Biodiversity Heritage Library, pre-1929 books via Hathifiles/Internet Archive, Library of Congress's "Selected Digitized Books," and selected public domain books from Project Gutenberg [raecompressive2019].
Open Educational Resources: Openly licensed content from Directory of Open Access Books (DOAB), PressBooks, OERCommons, and LibreTexts.
Wikis: Content from official Wikimedia wikis (Wikipedia, Wikinews, etc. - CC BY-SA) and a filtered subset of Wikiteam unofficial dumps (CC BY, CC BY-SA, public domain) after checking for license laundering.
Source Code: The openly licensed subset of The Stack V2 lozhkov2024starcoder and public domain Python Enhancement Proposals (PEPs).
Transcribed Audio Content: Transcriptions (using Whisper [radford2022robustspeechrecognitionlargescale]) of manually curated, speech-heavy YouTube videos explicitly uploaded under a CC BY license to reduce license laundering risk.
Web Text: Filtered and deduplicated content from 52 Common Crawl snapshots containing explicit CC BY, CC BY-SA, or CC0 markers, after manually verifying top domains. Also includes specific sites like Foodista, Open Newswire news sites, and the Public Domain Review.
Before using the raw Common Pile for pretraining, the authors created a processed dataset called the "Comma dataset." This involved extensive preprocessing and filtering using the Dolma toolkit [soldaini2024dolma], including language identification (FastText [joulin2017bag]), text quality filtering (DataComp-LM [li2025datacomplm]), removal of documents with high OCR errors (likelihood-based filtering [peS2o]), toxicity filtering (FastText classifiers), PII redaction (regex), and source-specific boilerplate removal (regex) (see Appendix \autoref{tab:filtering}). Document-level fuzzy deduplication was performed using a bloom filter, deeming documents duplicates if they shared >90% of 20-grams. Code data received additional language-specific filtering and quality classification [allal2025smoLLM2smolgoesbig]. Finally, the data sources were heuristically mixed based on per-source model performance (trained on 28B tokens) to up- or down-weight sources and target specific repetition rates (max 6 repeats for 1 trillion tokens), acknowledging that source size was not correlated with perceived quality (e.g., USPTO text). The final Comma dataset is approximately 1.8TB filtered, with an effective size of ~4TB after mixing and before repetition for 1T tokens.
To validate the quality of the Common Pile, the authors conducted two sets of experiments:
Controlled Dataset Quality Experiments: They trained 1.7 billion parameter models on 28 billion tokens of data from the Comma dataset and compared them against models trained on existing openly licensed corpora (OLC [min2024silo], Common Corpus [commoncorpus], KL3M [bommarito2025kl3mdataprojectcopyrightclean]) and prominent unlicensed datasets (The Pile [gao2020pile800gbdatasetdiverse], OSCAR [OrtizSuarezSagotRomary2019], FineWeb [penedo2024fineweb]). Evaluated on "early signal" benchmarks (ARC, MMLU, HellaSwag, OpenBookQA, CommonSenseQA, PIQA, SocialIQA). Results (Figure \autoref{fig:data-ablation}, Appendix \autoref{tab:ablationresults}) showed that models trained on the Comma dataset consistently outperformed those trained on other openly licensed corpora and The Pile. It performed comparably to the OSCAR dataset but slightly lagged behind FineWeb overall, although it excelled on scientific/scholarly benchmarks (MMLU, ARC). An ablation removing curated task data from the Common Pile showed minimal impact on performance.
Large-Scale Training with Comma v0.1 Models: They trained two 7 billion parameter LLMs, Comma v0.1-1T and Comma v0.1-2T, on 1 trillion and 2 trillion tokens, respectively, of the Comma dataset. A custom BPE tokenizer with a 64,000 vocabulary was trained on the Comma dataset. The training followed a Llama-like architecture using the lingua framework [metalingua], with a two-stage training process (cosine learning rate + cool-down phase on high-quality data) and checkpoint averaging. They compared performance on a broad benchmark suite (including knowledge, reasoning, and code tasks) against compute-matched models trained on unlicensed data (Llama 1/2 7B [touvron2023llamaopenefficientfoundation, touvron2023llama2], MPT-7B [MosaicML2023Introducing], etc.), also including Qwen3 8B qwen3 as a reference. Results (Figure \autoref{fig:comma-7b}, \autoref{fig:comma-7b-2t}, Appendix \autoref{tab:benchmarkresults}, \autoref{tab:ablation2tbenchmarkresults}) demonstrated that Comma v0.1-1T and -2T achieved competitive performance with similarly budgeted models trained on unlicensed data, particularly strong on knowledge-based and coding tasks, though weaker on HellaSwag and PIQA. The authors note that the 2T training run involved significant data repetition, and performance might be improved with a 2T-specific curriculum.
The paper concludes that it is indeed possible to train performant LLMs using only public domain and openly licensed text. The Common Pile v0.1, as the largest such dataset to date, demonstrates this potential. By releasing the dataset, the training code, and the Comma v0.1 model checkpoints, the authors aim to contribute to a more ethical, transparent, and reproducible LLM ecosystem. They highlight the growing availability of openly licensed data (Appendix \autoref{fig:dataquantity}) as a promising sign for future scaling efforts in this domain.