Common Pile v0.1: Open LLM Dataset
- Common Pile v0.1 is a large, open-license text dataset designed for training language models while ensuring strict legal and ethical sourcing.
- It aggregates approximately 233 million documents from 30 diverse sources, including research articles, code, books, and legal texts.
- The curation process features rigorous cleaning, deduplication, and manual validation to deliver a high-quality, transparent training corpus for LLM research.
Common Pile v0.1 is an openly licensed, large-scale text dataset explicitly curated for training LLMs. Developed and released in 2025, it addresses prevalent concerns regarding intellectual property and ethical sourcing in LLM pretraining by exclusively collecting content in the public domain or under permissive open licenses. The dataset is accompanied by benchmark 7B-parameter LLMs (Comma v0.1-1T and 2T), as well as detailed curation code and mixture recipes, setting a new standard for transparency, quality, and legal clarity in the field.
1. Composition and Domain Coverage
Common Pile v0.1 aggregates text from 30 distinct sources, aiming for maximal diversity while maintaining rigid licensing standards. The sources and content types reflected in the mixture include:
- Research Literature: arXiv, PubMed Central, peS2o deliver scientific papers from a broad range of disciplines.
- Code: BigCode/Stack v2, open-license subsets of GitHub, and Python Enhancement Proposals provide rich programming language data.
- Books: Project Gutenberg, Biodiversity Heritage Library, Library of Congress (with a focus on pre-1929 public domain works).
- Wikis: English Wikipedia, other Wikimedia content, and third-party wikis archived by WikiTeam.
- Legal, Government, and Patents: US GPO, USPTO patents, Caselaw Access Project, CourtListener, UK Hansard, Regulations.gov.
- Online Communities: StackExchange, Ubuntu IRC, open issues and comments from GitHub.
- Education: Directory of Open Access Books, PressBooks, OERCommons, LibreTexts.
- Supervised Datasets: Data Provenance Initiative, Flan, Open Assistant data.
- Multimedia Transcripts: Openly licensed YouTube channels, with high-quality speech transcriptions via Whisper.
- Web Text: English pages from Common Crawl with explicit CC BY, CC BY-SA, or CC0 licensing, Foodista, news portals with open licensing, Public Domain Review.
Statistics:
- Raw size: 8 TB (text).
- Filtered/deduped size: ~1.8 TB.
- Number of documents: Approximately 233 million.
- Effective training mixture: ~999 billion tokens per 1 trillion-token epoch for model pretraining.
The dataset emphasizes linguistic, thematic, and modal diversity, covering science, technical writing, law, fiction, news, technical discussion, and formal and informal prose. The primary language is English.
2. Data Acquisition and Curation Methodology
Strict Licensing Enforcement:
- All included content must meet Open Knowledge Foundation's Open Definition 2.1.
- Excludes CC BY-NC, CC BY-ND, and other licenses with non-commercial or no-derivative restrictions.
- Only datasets and web domains where all documents possess explicit, open licensing are accepted—sources with ambiguous, inferred, or possibly "laundered" licenses are excluded.
Curation and Processing Pipeline:
- Cleaning: Quality heuristics (DataComp-LM classifier for web, code-specific filtering for Stack v2), OCR error removal, language identification (FastText), and toxic content filtering (FastText/Jigsaw).
- PII Redaction: Regular expressions to remove personally identifiable information.
- Deduplication: Aggressive, global document-level fuzzy deduplication, using Bloom filters to remove documents with high n-gram overlap (>90% 20-gram).
- Manual Review: Highest-volume domains and web text samples are manually validated for compliance.
- Mixture Rebalancing: High-quality or smaller-volume sources may be upsampled (up to 6×/epoch), while web and low-quality sources are downweighted; mixture weights are determined based on model ablation performance and empirical data quality.
- Synthetic Data: Explicit exclusion of LLM-generated text to avoid contamination when model training data provenance may be unclear or legally ambiguous.
Significance:
This approach ensures a high-quality, legally transparent dataset at terabyte scale—something not achieved by prior open-license efforts, which either included at-risk content or failed to match coverage and size.
3. Model Training and Empirical Validation
Comma v0.1 Models:
- Architecture: 7B-parameter, Llama-style decoder-only Transformers (Comma v0.1-1T and v0.1-2T).
- Training Regime: Trained on 1T and 2T tokens, respectively, using the Comma dataset—a filtered and balanced version of Common Pile v0.1.
- Tokenizer: Custom 64k-vocabulary BPE trained on a 600 GB Comma sample; Unicode is not normalized, regex matches Llama-3.2/HuggingFace ByteLevel.
- Optimizer/Framework: Meta Lingua, cosine LR schedule with cooldown, AdamW, with batch sizes up to 2048 × 4096.
Performance Benchmarks:
- Controlled Ablation: 1.7B-parameter models trained on Comma, OLC, Common Corpus, KL3M, The Pile, OSCAR, and FineWeb. Comma outperformed all other open-license datasets and came close to The Pile, which contains substantial non-open content.
- Direct Comparison: Against LLMs trained on unlicensed text (Llama 1/2, MPT-7B, OpenLLaMA, StableLM, RedPajama), Comma v0.1 models met or exceeded performance on knowledge-intensive and coding benchmarks (e.g., ARC, MMLU, coding), though models lagged on some commonsense and conversational tests, reflecting gaps in open-license data coverage.
- Robustness: Additional experiments demonstrate stability with respect to hyperparameter and curriculum changes.
Performance Table (indicative, as per paper):
Model | Training Corpus | Raw Size | License Status | MMLU | ARC | Code |
---|---|---|---|---|---|---|
Comma 1T | Common Pile v0.1 | 8 TB | Open | Competitive | Competitive | Leading |
Llama 2 7B | Unlicensed (Meta) | 2 TB+ | Mixed | Baseline | Baseline | Baseline |
A plausible implication is that performant LLMs can be trained using strictly open text if curation and coverage are sufficient.
4. Legal, Ethical, and Practical Considerations
Licensing and Reuse:
- All data and derived model artifacts adhere to the Open Knowledge Foundation definition; explicit provenance and source-level licensing metadata are maintained.
- Models and data are fully distributable, auditable, and reusable by third parties, freeing both research and commercial users from legal uncertainty typical of web-scraped datasets.
Ethical Controls:
- Exclusion of synthetic (LLM-generated) text prevents ambiguous downstream rights.
- Content with possible license laundering is systematically screened out.
- PII removal is enforced as much as feasible, though perfect removal is not guaranteed.
Gaps and Cautions:
- The dataset is presently English-centric.
- Strict licensing requirements mean some domains (e.g., certain forums, recent copyrighted books) are underrepresented compared to less restrictive corpora.
- Some document-level licensing errors may pass through automated screening, but downstream users have access to full provenance to audit and curate subsets as needed.
5. Community Release and Resources
Open Distribution:
- Data: All raw, filtered, and deduped versions available via Hugging Face.
- Curation Code: Complete source code for acquisition, filtering, deduplication, mixture balancing, and validation via GitHub.
- Model Artifacts: Comma v0.1-1T and 2T checkpoints, mixture definitions, and tokenizer available on Hugging Face.
- Documentation: Mixture recipes, license metainformation, and configuration files accompany the release.
Legal and Ethical Leadership:
- By strictly enforcing open licensing at scale, Common Pile v0.1 provides a clear migration path for researchers and industry wary of legal or ethical uncertainty in LLM deployment.
- The dataset’s transparency enables novel research in dataset auditability, attribution tracing, and license-compliant LLM output.
- The project is positioned as a "first step" in establishing best practices for scalable, legal, and ethically sound data curation for future AI research.
6. Impact and Future Prospects
Sectoral Impact:
- Common Pile v0.1 is the largest, highest-quality open-license LLM pretraining dataset released to date. It is directly competitive in scale, diversity, and utility with widely used (but risk-laden) corpora like The Pile and MassiveText, providing a new reference for ethical LLM research and commercial deployment.
Open Research Directions:
- As the pool of openly licensed content continues to expand through initiatives in publishing, education, and government transparency, further improvements in coverage and quality are expected.
- Potential exists for inclusion of more languages, richer dialog/conversation sources, and more dynamic domain-responsiveness via mixture rebalancing.
- A plausible implication is that future LLMs trained exclusively on open data may attain or surpass the performance of current models on all benchmarks as open content grows.
Summary Table: Main Features of Common Pile v0.1
Feature | Description |
---|---|
Size (raw/filtered) | 8 TB / ~1.8 TB |
Sources | 30 (science, code, books, law, education, multimedia, web) |
Licensing | Public domain / open license only (CC BY, CC0, etc.) |
Deduplication | Global, fuzzy, aggressive |
Mixture Rebalancing | Based on empirical ablation and data utility |
Model performance | Competitive with Llama 1/2 7B (unlicensed) |
Code/recipes released | Yes (full data curation and model training) |
Legal/ethical focus | Strict license compliance; no synthetic/uncertain data |
Common Pile v0.1 sets a new benchmark for open, scalable, and legally unambiguous LLM pretraining, supporting both high-performance research and compliance-conscious deployment across sectors.