Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

140 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Comma v0.1-2T: Open-License LLM

Updated 30 June 2025

Comma v0.1-2T is a 7B-parameter large language model trained solely on curated open-license data from the Common Pile v0.1 corpus.
It employs a two-stage training process with explicit source weighting and checkpoint averaging to achieve stable, scalable performance.
The model attains competitive benchmark scores against Llama 1/2 7B while ensuring full auditability and copyright compliance.

Comma v0.1-2T is a 7 billion-parameter LLM trained exclusively on public domain and openly licensed text, as a validation of the Common Pile v0.1 corpus and a technical advance for fair, transparent, and scalable LLMs. It stands as one of two reference models—Comma v0.1-1T and Comma v0.1-2T—trained on 1 and 2 trillion tokens, respectively, using the Common Pile as their sole data source. Comma v0.1-2T is notable for attaining competitive performance to similarly sized LLMs trained on unlicensed webscrapes, including Llama 1 and Llama 2 7B, within a fully auditable and copyright-compliant framework.

1. Data Foundation: The Common Pile v0.1 Corpus

Comma v0.1-2T is trained on a curated subset of Common Pile v0.1, an 8 terabyte corpus drawn from 30 sources spanning varied domains including scientific publications, legal documents, books, encyclopedic resources, code, government records, educational materials, online discussions, and CC-licensed audio transcripts.

The corpus emphasizes license clarity, excluding all CC-NC/CC-ND and ambiguous collections, with manual and automated verification to avoid "license laundering" and ambiguous attributions.
Scientific and technical content includes peS2o (filtered S2ORC), ArXiv with explicit license curation, and PubMed Central.
Legal and government data sources comprise the US GovInfo API, US Patents, UK Hansard, Regulations.gov, CaseLaw Access Project, and Court Listener, most of which are public domain.
Wiki and reference material derive from Wikimedia Foundation projects and >300,000 independently run wikis, all confirmed open-licensed.
All code included is from the Stack V2 (open license only), PEPs, and GitHub projects with Blue Oak Council-approved licenses.
The collection also integrates more than 1 million transcribed YouTube videos (CC BY), and a wide variety of open educational resources (e.g., LibreTexts, DOAB).
Curated task-format data (for supervised fine-tuning) are included only where explicitly open-licensed and documented via the Data Provenance Initiative.

Each data stream is filtered for language, quality, toxicity, and personally identifying information. Duplicates are globally removed, and each document maintains explicit provenance metadata.

2. Training Procedure and Model Configuration

Comma v0.1-2T is a 7B-parameter decoder-only LLaMA-family transformer with a 4096-token context window, tokenized using a custom 64k BPE vocabulary trained on 600GB of representative corpus data.

Models are trained under the lingua framework, an efficient, open-source LLM training toolkit.
The model is trained for 2 trillion tokens with a two-stage procedure:
1. Main training: On a carefully quality-weighted and source-balanced mixture, with per-source repetition ratios determined by pilot modeling to mitigate overfitting to small or high-quality components (e.g., key sources up to 16× repeated for 2T run).
2. Cool-down phase: Learning rate is reduced to zero and training is run on a higher-quality subset (dominantly research, code, and encyclopedic sources). The final model is an average over the last 10 cool-down checkpoints.
AdamW optimizer is used with a weight decay of 0.2, and cosine learning rate decay, mirroring Llama 3 best-practice regimes.
Checkpoints from every phase, the full training mixture specification, filters, and the final model are released in full.

3. Performance and Benchmarking

Comma v0.1-2T is evaluated on a suite of standard LLM benchmarks aligned with the OLMES framework, including:

Reasoning and knowledge: ARC (Challenge/Easy), MMLU, CommonsenseQA, OpenBookQA, SocialIQA, PIQA, HellaSwag, BoolQ.
Code generation: HumanEval (pass@10), MBPP.

Benchmark results are directly compared to Llama 1 and 2 7B models (trained on ~1T and 2T tokens from unlicensed web corpus data), OLMo-Twin, DeepSeekLLM, StableLM, MPT-7B, OpenLLaMA-7B, and compute-heavy Qwen3 8B as an upper-bound.

Knowledge and domain-specific tasks: Comma v0.1-2T is on par or marginally superior to its unlicensed-data LLaMA counterparts in ARC and MMLU and performs well on OpenBookQA and SocialIQA.
Code generation: Comma v0.1-2T is notably strong, improving over other open-data LLMs and matching or beating Llama 2 7B.
Commonsense tasks: The model lags slightly in HellaSwag and PIQA, plausibly due to a relative lack of informal and blog-style data in the corpus.
Data ablation: Repeating critical high-quality sources up to 16× at 2T tokens yields minimal degradation, indicating open-license mixtures are viable for larger-scale pretraining.

Model	ARC-C	MMLU	HellaSwag	OBQA	CSQA	SIQA	HumanEval	MBPP
LLaMA-1 7B	44.5	34.8	76.2	51.2	61.8	50.3	19.9	27.9
Comma-1T	52.8	42.4	62.6	47.0	59.4	50.8	36.5	35.5
LLaMA-2 7B	48.5	45.8	76.2	48.4	62.8	50.8	26.1	28.5
Comma-2T	45.8	49.8	64.4	46.2	64.0	52.3	44.2	41.5

4. Ethical, Legal, and Policy Implications

Comma v0.1-2T establishes that high-quality LLMs can be trained solely on open-licensed and public domain data, directly addressing common legal and ethical critiques of existing LLMs. The curation workflow explicitly forbids text with non-commercial or no-derivatives clauses, ambiguous or bundled licenses, and synthetic LLM-generated data of unclear provenance.

Each document in the corpus is traceable in provenance metadata, allowing for future source-level retraction or compliance audits.
All processing, including audit of source licenses and compliance with attribution requirements, is documented and reproducible.
The open license approach aligns with emergent LLM regulatory proposals and offers a replicable blueprint for future copyright-compliant LLM development.

5. Technical Implementation and Release Protocols

Comma v0.1-2T leverages standard modern LLM engineering while employing specific design choices for open data:

Tokenizer: 64k BPE vocabulary, non-normalizing, follows Llama's regex split.
Data mixture: Composed with explicit source weighting; full specifications and weights are published.
Curriculum: Early over-sampling of small but high-quality sources via weighted mixture protects representation of vital technical and supervised data.
Final checkpoint averaging: The last 10 checkpoints from cool-down are averaged to maximize stability, mirroring Llama 3 and OLMo protocols.

Release artifacts include:

All processed data (Common Pile and filtered mixtures), with full license and provenance.
All code for data cleaning, mixture construction, and model training.
Model checkpoints, training logs, and mixture weights on HuggingFace.
Detailed documentation for scientific reuse and audit.

6. Influence and Prospective Developments

Comma v0.1-2T demonstrates that large, performant, and compliant LLMs can be trained without proprietary or scraped web data. Its architecture, open-source protocols, and documented mixture construction provide a standard for subsequent work in open-data LLMs.

The continued expansion of open-licensed content—half of which in Common Pile v0.1 appeared post-2020—suggests further improvements in LLMs trained under this paradigm.
Open models such as Comma v0.1-2T enable deep inspection of data-leakage, memorization, data valuation, and ethical implications, fostering empirical research into LLM training data dynamics.
The approach may influence regulatory, industrial, and academic standards for lawful, fair, and reproducible model development.

7. Summary Table: Core Properties of Comma v0.1-2T

Feature	Comma v0.1-2T
Parameter count	7B
Training data	Common Pile (8TB, exclusively open/public domain)
Token count	2T
Benchmark parity	Yes (Llama 1/2 7B, OLMo-Twin, etc.)
Model type	Decoder-only transformer, LLaMA architecture
Context window	4096 tokens
Tokenizer	64k BPE, non-normalizing
License properties	All data explicitly open/public domain
Code/data/model availability	Full pipeline, checkpoints, and data released open source
Notable advantages	Full auditability, policy compliance, scientific utility

Comma v0.1-2T sets a baseline for responsible, transparent, and high-performance LLM development using verifiably open data, and provides a template for models and pipelines compliant with evolving legal and community ethics in AI research and deployment.

PDF Markdown Chat (Upgrade)