Common Corpus: Ethical AI Pre-training Data
- Common Corpus is a massive, multilingual dataset compiled exclusively from open, ethically sourced content for large language model pre-training.
- It comprises six diverse collections—government, culture, science, code, web, and semantic—covering hundreds of languages and domains.
- Its rigorous curation and permissive licensing ensure compliance with legal and ethical standards, supporting reproducible AI research and commercial deployment.
The term "Common Corpus" refers, in its most influential recent usage, to a massive, multilingual, and ethically constructed dataset designed explicitly for the pre-training of LLMs and related AI systems. The Common Corpus presented by Bolla et al. (2025) marks a significant milestone in open data infrastructure: at nearly two trillion tokens, it is the largest publicly available collection meeting stringent legal, ethical, and technical requirements for use in AI research and deployment. Unlike earlier collections that incorporated substantial copyrighted or proprietary content, the Common Corpus is assembled exclusively from uncopyrighted or permissively licensed sources and is extensively documented for provenance, license, and curation procedures. Its design underpins both scientific and commercial AI efforts constrained by current and forthcoming data security and intellectual property regulations.
1. Dataset Composition and Structure
Common Corpus encompasses six principal, internally diverse collections:
- Open Government: Administrative, legal, and fiscal texts from international and national public domain sources (e.g., SEC, WTO, EU documents), spanning numerous languages.
- Open Culture: Digitized monographs, newspapers, and periodicals—including historic and heritage materials—across at least 13 major languages (French, English, German, Spanish, Portuguese, Italian, Dutch, Luxembourgish, Danish, Swedish, Serbian, Czech, Greek) plus smaller but significant representation in languages such as Arabic, Bengali, Latin, Persian, Russian, Sanskrit, and Urdu.
- Open Science: Openly licensed scientific publications (OpenAlex, arXiv), theses, book reviews, and clinical trials with a primary focus on English but substantial French, Spanish, and German content.
- Open Code: Fully permissively licensed code (Stack v1, v2), embracing over 600 programming languages. Example token counts: Java (35.7B), JavaScript (28.9B), Python (26.7B), C++ (25.5B).
- Open Web: Curated resources such as Wikipedia, Wikisource, StackExchange forums, and Common-Creative YouTube transcripts.
- Open Semantic: Structured factual data, notably fully natural language renditions of Wikidata in 300 languages.
The full corpus exceeds 2 trillion tokens, covering 517 million+ documents. Token distribution illustrates its breadth: Open Government (406B), Open Culture (886B), Open Science (281B), Open Code (283B), Open Web (73B), and Open Semantic (68B).
A summary example table:
Collection | Tokens | Main Domains | Core Languages |
---|---|---|---|
Open Government | 406B | Legal, admin, finance | EN, FR, DE, PL, others |
Open Culture | 886B | Books, newspapers | EN, FR, DE, ES, IT, ... |
Open Science | 281B | Scholarly publications | EN (85%), FR, ES, DE |
Open Code | 283B | Software/code | 600+ programming langs |
Open Web | 73B | Wikipedia, etc. | Multilingual |
Open Semantic | 68B | Wikidata triples NL text | Multilingual (300 langs) |
The tokenization standard is provided by the Pleias base tokenizer, ensuring compatibility for large-scale AI pre-training.
2. Licensing, Legal, and Ethical Compliance
All dataset components are either public domain or distributed under explicitly permissive licenses (e.g., CC-By, CC0, MIT, Apache, BSD, Open Data Commons). The curation process meticulously tracks license data, provenance, and other metadata for every document, facilitating compliance with regional regulation (e.g., GDPR, EU AI Act). In detail:
- Public Domain: 1.1T tokens.
- Other licenses: CC-By (287B), MIT (143B), Apache 2.0 (69B), etc.
Systematic legal review, including checks for author date of death (Berne Convention, +70 years), publication timing, and digitization rights, ensures only legitimately open content is retained. Sensitive data is automatically processed to remove or mask personally identifiable information (PII), using Microsoft Presidio and custom regex tools. In addition, toxicity filtering is enforced through Celadon, an open-source, multilingual classifier targeting race, gender, religion, ability, and violence/abuse dimensions. All major processing, filtering, and metadata enrichment pipelines are fully documented and open-sourced.
This strict approach obviates copyright and privacy issues that have affected other widely-used datasets (e.g., Books3, LAION, MATH), thus enabling pre-trained models to be distributed, deployed, or commercialized without legal uncertainty.
3. Provenance, Curation, and Data Quality
Curation in Common Corpus is characterized by:
- Source validation: Partnerships with libraries, governmental data portals, and open repositories ensure all data is traceable and verifiable.
- Metadata completeness: Each record is annotated for license, language, document origin, and additional quality indicators.
- Text segmentation: The Segmentext tool robustly segments raw and OCR’d texts, supporting multilingual, multi-domain ingestion.
- OCR correction: Statistical (OCRoscope, cld2) and ML-based (OCRerrcr, DeBERTa) detectors quantify and annotate errors; Llama 3-based OCRonos performs correction across languages.
- Additional cleaning: Code files are filtered for language and structure, and only files with high utility and open licenses are included.
- Noise and bias filtering: Celadon is applied throughout to purge toxic content, especially in historical or noisy data.
Data are distributed as 10,000 Parquet files on HuggingFace, with each object’s metadata supporting complex subsetting (by language, license, domain, etc.).
4. Applications and Scientific Impact
Common Corpus underpins both research and industrial LLM development:
- European and Multilingual LLMs: Directly used for training major European open-source models (Pleias 350M/1.2B/3B, Salamandra, Lucie, Nvidia NeKo) and in Anthropic’s interpretability research.
- Open benchmarks and datasets: Components have seeded new multimedia datasets (FineVideo, Mosel) and open evaluation sets.
- Open science infrastructure: Acts as a legal and reproducible foundation for public laboratories, startups, and non-profits, reducing risk and enhancing sustainability.
- Multilingual and cultural coverage: Provides unprecedented depth for low-and medium-resource languages, directly addressing prior biases in AI training data.
Examples of mixture-based pre-training strategies leveraging the corpus:
where each is a corpus component (e.g., Open Science, Open Culture), and the weights and metadata filters define training distributions.
5. Technical Infrastructure and Tools
The corpus is structured for scalability and reproducibility:
- HuggingFace distribution: Efficient download and sharding across compute clusters.
- Tokenization: Consistent with the Pleias tokenizer for multilingual modeling.
- Open-source processing tools: Segmentext (segmentation), OCRonos (OCR correction), Celadon (toxicity detection), and code filtering scripts are released for community auditing and extension.
- Flexible subsetting: Metadata (license, language, curation history) enables users to extract corpus subsets compatible with varied licensing regimes or regulatory requirements.
For code filtering, only files in the target programming language and meeting content criteria (eventually >= 25% alpha tokens, appropriate line lengths, etc.) are retained.
6. Significance and Role in AI Research and Industry
Common Corpus represents the first truly large-scale, multilingual, legally unencumbered dataset for the pre-training, release, and commercialization of LLMs and smaller LLMs. Its architecture supports compliance with the open-source AI definition (“any purpose and without having to ask for permission”), thereby enabling open science and industry collaboration. Its scale and diversity address earlier dataset’s limitations, notably monolingual/codedominant bias, legal/ethical risks, unreproducible curation, and insufficient coverage for many languages, coding domains, and knowledge areas.
The corpus is recognized as critical research infrastructure and is already in use by leading industry and academic modeling groups. Its documentation and licensing stand as a template for future responsible AI corpus design.
References for further exploration:
- Common Corpus on HuggingFace
- Open source tooling and additional documentation are provided via links in the main paper and in the metadata of the corpus itself.