Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

101 tokens/sec

Gemini 2.5 Pro Premium

50 tokens/sec

GPT-5 Medium

28 tokens/sec

GPT-5 High Premium

27 tokens/sec

GPT-4o

101 tokens/sec

DeepSeek R1 via Azure Premium

90 tokens/sec

GPT OSS 120B via Groq Premium

515 tokens/sec

Kimi K2 via Groq Premium

220 tokens/sec

2000 character limit reached

Common Pile v0.1: Open LLM Dataset

Updated 30 June 2025

Common Pile v0.1 is a large, open-license text dataset designed for training language models while ensuring strict legal and ethical sourcing.
It aggregates approximately 233 million documents from 30 diverse sources, including research articles, code, books, and legal texts.
The curation process features rigorous cleaning, deduplication, and manual validation to deliver a high-quality, transparent training corpus for LLM research.

Common Pile v0.1 is an openly licensed, large-scale text dataset explicitly curated for training LLMs. Developed and released in 2025, it addresses prevalent concerns regarding intellectual property and ethical sourcing in LLM pretraining by exclusively collecting content in the public domain or under permissive open licenses. The dataset is accompanied by benchmark 7B-parameter LLMs (Comma v0.1-1T and 2T), as well as detailed curation code and mixture recipes, setting a new standard for transparency, quality, and legal clarity in the field.

1. Composition and Domain Coverage

Common Pile v0.1 aggregates text from 30 distinct sources, aiming for maximal diversity while maintaining rigid licensing standards. The sources and content types reflected in the mixture include:

Research Literature: arXiv, PubMed Central, peS2o deliver scientific papers from a broad range of disciplines.
Code: BigCode/Stack v2, open-license subsets of GitHub, and Python Enhancement Proposals provide rich programming language data.
Books: Project Gutenberg, Biodiversity Heritage Library, Library of Congress (with a focus on pre-1929 public domain works).
Wikis: English Wikipedia, other Wikimedia content, and third-party wikis archived by WikiTeam.
Legal, Government, and Patents: US GPO, USPTO patents, Caselaw Access Project, CourtListener, UK Hansard, Regulations.gov.
Online Communities: StackExchange, Ubuntu IRC, open issues and comments from GitHub.
Education: Directory of Open Access Books, PressBooks, OERCommons, LibreTexts.
Supervised Datasets: Data Provenance Initiative, Flan, Open Assistant data.
Multimedia Transcripts: Openly licensed YouTube channels, with high-quality speech transcriptions via Whisper.
Web Text: English pages from Common Crawl with explicit CC BY, CC BY-SA, or CC0 licensing, Foodista, news portals with open licensing, Public Domain Review.

Statistics:

Raw size: 8 TB (text).
Filtered/deduped size: ~1.8 TB.
Number of documents: Approximately 233 million.
Effective training mixture: ~999 billion tokens per 1 trillion-token epoch for model pretraining.

The dataset emphasizes linguistic, thematic, and modal diversity, covering science, technical writing, law, fiction, news, technical discussion, and formal and informal prose. The primary language is English.

2. Data Acquisition and Curation Methodology

Strict Licensing Enforcement:

All included content must meet Open Knowledge Foundation's Open Definition 2.1.
Excludes CC BY-NC, CC BY-ND, and other licenses with non-commercial or no-derivative restrictions.
Only datasets and web domains where all documents possess explicit, open licensing are accepted—sources with ambiguous, inferred, or possibly "laundered" licenses are excluded.

Curation and Processing Pipeline:

Cleaning: Quality heuristics (DataComp-LM classifier for web, code-specific filtering for Stack v2), OCR error removal, language identification (FastText), and toxic content filtering (FastText/Jigsaw).
PII Redaction: Regular expressions to remove personally identifiable information.
Deduplication: Aggressive, global document-level fuzzy deduplication, using Bloom filters to remove documents with high n-gram overlap (>90% 20-gram).
Manual Review: Highest-volume domains and web text samples are manually validated for compliance.
Mixture Rebalancing: High-quality or smaller-volume sources may be upsampled (up to 6×/epoch), while web and low-quality sources are downweighted; mixture weights are determined based on model ablation performance and empirical data quality.
Synthetic Data: Explicit exclusion of LLM-generated text to avoid contamination when model training data provenance may be unclear or legally ambiguous.

Significance:

This approach ensures a high-quality, legally transparent dataset at terabyte scale—something not achieved by prior open-license efforts, which either included at-risk content or failed to match coverage and size.

3. Model Training and Empirical Validation

Comma v0.1 Models:

Architecture: 7B-parameter, Llama-style decoder-only Transformers (Comma v0.1-1T and v0.1-2T).
Training Regime: Trained on 1T and 2T tokens, respectively, using the Comma dataset—a filtered and balanced version of Common Pile v0.1.
Tokenizer: Custom 64k-vocabulary BPE trained on a 600 GB Comma sample; Unicode is not normalized, regex matches Llama-3.2/HuggingFace ByteLevel.
Optimizer/Framework: Meta Lingua, cosine LR schedule with cooldown, AdamW, with batch sizes up to 2048 × 4096.

Performance Benchmarks:

Controlled Ablation: 1.7B-parameter models trained on Comma, OLC, Common Corpus, KL3M, The Pile, OSCAR, and FineWeb. Comma outperformed all other open-license datasets and came close to The Pile, which contains substantial non-open content.
Direct Comparison: Against LLMs trained on unlicensed text (Llama 1/2, MPT-7B, OpenLLaMA, StableLM, RedPajama), Comma v0.1 models met or exceeded performance on knowledge-intensive and coding benchmarks (e.g., ARC, MMLU, coding), though models lagged on some commonsense and conversational tests, reflecting gaps in open-license data coverage.
Robustness: Additional experiments demonstrate stability with respect to hyperparameter and curriculum changes.

Performance Table (indicative, as per paper):

Model	Training Corpus	Raw Size	License Status	MMLU	ARC	Code
Comma 1T	Common Pile v0.1	8 TB	Open	Competitive	Competitive	Leading
Llama 2 7B	Unlicensed (Meta)	2 TB+	Mixed	Baseline	Baseline	Baseline

A plausible implication is that performant LLMs can be trained using strictly open text if curation and coverage are sufficient.

4. Legal, Ethical, and Practical Considerations

Licensing and Reuse:

All data and derived model artifacts adhere to the Open Knowledge Foundation definition; explicit provenance and source-level licensing metadata are maintained.
Models and data are fully distributable, auditable, and reusable by third parties, freeing both research and commercial users from legal uncertainty typical of web-scraped datasets.

Ethical Controls:

Exclusion of synthetic (LLM-generated) text prevents ambiguous downstream rights.
Content with possible license laundering is systematically screened out.
PII removal is enforced as much as feasible, though perfect removal is not guaranteed.

Gaps and Cautions:

The dataset is presently English-centric.
Strict licensing requirements mean some domains (e.g., certain forums, recent copyrighted books) are underrepresented compared to less restrictive corpora.
Some document-level licensing errors may pass through automated screening, but downstream users have access to full provenance to audit and curate subsets as needed.

5. Community Release and Resources

Open Distribution:

Data: All raw, filtered, and deduped versions available via Hugging Face.
Curation Code: Complete source code for acquisition, filtering, deduplication, mixture balancing, and validation via GitHub.
Model Artifacts: Comma v0.1-1T and 2T checkpoints, mixture definitions, and tokenizer available on Hugging Face.
Documentation: Mixture recipes, license metainformation, and configuration files accompany the release.

Legal and Ethical Leadership:

By strictly enforcing open licensing at scale, Common Pile v0.1 provides a clear migration path for researchers and industry wary of legal or ethical uncertainty in LLM deployment.
The dataset’s transparency enables novel research in dataset auditability, attribution tracing, and license-compliant LLM output.
The project is positioned as a "first step" in establishing best practices for scalable, legal, and ethically sound data curation for future AI research.

6. Impact and Future Prospects

Sectoral Impact:

Common Pile v0.1 is the largest, highest-quality open-license LLM pretraining dataset released to date. It is directly competitive in scale, diversity, and utility with widely used (but risk-laden) corpora like The Pile and MassiveText, providing a new reference for ethical LLM research and commercial deployment.

Open Research Directions:

As the pool of openly licensed content continues to expand through initiatives in publishing, education, and government transparency, further improvements in coverage and quality are expected.
Potential exists for inclusion of more languages, richer dialog/conversation sources, and more dynamic domain-responsiveness via mixture rebalancing.
A plausible implication is that future LLMs trained exclusively on open data may attain or surpass the performance of current models on all benchmarks as open content grows.

Summary Table: Main Features of Common Pile v0.1

Feature	Description
Size (raw/filtered)	8 TB / ~1.8 TB
Sources	30 (science, code, books, law, education, multimedia, web)
Licensing	Public domain / open license only (CC BY, CC0, etc.)
Deduplication	Global, fuzzy, aggressive
Mixture Rebalancing	Based on empirical ablation and data utility
Model performance	Competitive with Llama 1/2 7B (unlicensed)
Code/recipes released	Yes (full data curation and model training)
Legal/ethical focus	Strict license compliance; no synthetic/uncertain data

Common Pile v0.1 sets a new benchmark for open, scalable, and legally unambiguous LLM pretraining, supporting both high-performance research and compliance-conscious deployment across sectors.

PDF Markdown Chat (Upgrade)

Common Pile v0.1: Open LLM Dataset

1. Composition and Domain Coverage

2. Data Acquisition and Curation Methodology

3. Model Training and Empirical Validation

4. Legal, Ethical, and Practical Considerations

5. Community Release and Resources

6. Impact and Future Prospects

Follow-up Questions

Don't miss out on important new AI/ML research

Common Pile v0.1: Open LLM Dataset

1. Composition and Domain Coverage

2. Data Acquisition and Curation Methodology

3. Model Training and Empirical Validation

4. Legal, Ethical, and Practical Considerations

5. Community Release and Resources

6. Impact and Future Prospects

Follow-up Questions

Related Topics

Don't miss out on important new AI/ML research