Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Comma v0.1-1T: Transparent Open LLM

Updated 30 June 2025
  • Comma v0.1-1T is a 7-billion-parameter large language model trained exclusively on openly licensed text, ensuring legal compliance and auditability.
  • It employs a decoder-only Transformer architecture with a custom 64,000-token BPE tokenizer, mirroring state-of-the-art models like Llama-7B.
  • Benchmark results show competitive performance across reasoning and code tasks, proving that high-quality LLMs can be built without unlicensed web data.

Comma v0.1-1T is a 7-billion parameter LLM trained for 1 trillion tokens exclusively on openly licensed and public domain text curated in the Common Pile v0.1 dataset. Developed to demonstrate that competitive LLMs can be built without recourse to unlicensed web data, Comma v0.1-1T achieves performance comparable to major models such as Llama 1 and 2 7B—whose pretraining data sources are largely internet-crawled and not always publicly available for audit or licensing verification. The model and its training pipeline are fully released alongside the dataset and data mixing codebase, establishing a new benchmark for transparency in LLM development.

1. Model Architecture and Tokenization

Comma v0.1-1T employs a decoder-only Transformer architecture structurally aligned with the Llama-7B model, implemented using the lingua framework. Parameterization consists of approximately 7 billion weights, with bottlenecked feedforward and multi-head causal self-attention layers, structurally mirroring state-of-the-art LLM best practices.

A bespoke 64,000-token Byte Pair Encoding (BPE) vocabulary is trained on 600GB of selected Common Pile documents, using regular expression preprocessing adapted from Llama 3.2 (with Unicode code points preserved, no normalization). Tokenization is precomputed on the training corpus, optimizing throughput.

Mathematically, each forward pass layer applies

LayerNorm(h+SelfAttn(h))\text{LayerNorm}(h + \text{SelfAttn}(h))

followed by

LayerNorm(h+MLP(h))\text{LayerNorm}(h' + \text{MLP}(h'))

for all positions tt and layers ll, where hh and hh' are the hidden representations at those layers. The output head produces conditional next-token predictions via

P(xt+1xt)=Softmax(Wht)P(x_{t+1}|x_{\leq t}) = \operatorname{Softmax}(W h_t)

with WW the output projection matrix.

2. The Common Pile v0.1 Dataset

The Common Pile is an 8TB, openly licensed corpus drawn from 30 source domains, including scientific articles, code repositories, books, encyclopedias, educational documents, legislative texts, government files, and licensed crawls of the broader web (e.g., Creative Commons Common Crawl, or CCCC). Proprietary domain material, proprietary books, or ambiguous-licensed datasets are excluded, with a complete audit trail provided for each subcorpus.

Filtering and cleaning are implemented as follows:

  • Language filtering: English selection via FastText.
  • Quality filtering: For CCCC, a DataComp-LM classifier trims out the lowest-yield portions; a unigram log-likelihood filter removes OCR artifacts.
  • Toxicity filtering: FastText toxic-content classifiers curated from the Jigsaw dataset.
  • Personal data redaction: Regular expressions strip emails, IPs, phone numbers.
  • Deduplication: A 20-grams-based bloom filter removes near-duplicate documents across and within sources.
  • Boilerplate stripping: Manual regex rules for major sources.

The dataset mixture is non-uniform, up-weighting high-utility sources (scientific literature, StackExchange, open code) and down-weighting low-information or domain-skewed corpora (USPTO patents, web junk, large but repetitive sites).

3. Pretraining and Optimization Pipeline

Comma v0.1-1T uses a two-stage optimization strategy:

  1. Main training: A curriculum incorporating all sources according to the designed mixture, for nearly 1 trillion tokens.
  2. Cool-down phase: Final training epochs restrict to the highest-quality sources while annealing the learning rate to zero for improved final-stage convergence.

Key hyperparameters include:

  • Batch size: 512 × 4096 tokens
  • Optimizer: AdamW
  • Weight decay: 0.2
  • Learning rate: Cosine schedule, with max 1×1031 \times 10^{-3}, min 1×1091 \times 10^{-9}
  • Warm-up: 2,000 steps
  • Main phase: 460,000 steps
  • Cool-down: 18,000 steps (linear decay)

The loss function is standard autoregressive cross-entropy: LCE=1Ni=1NlogP(xi+1xi;θ)\mathcal{L}_\mathrm{CE} = -\frac{1}{N} \sum_{i=1}^N \log P(x_{i+1} | x_{\leq i}; \theta)

Checkpoint averaging over the final 10 model snapshots provides smoother final weights.

4. Performance Benchmarks and Evaluation

Performance is quantitatively assessed on a mainstream suite of LLM benchmarks—mirroring OLMo, Llama, MPT, and other contemporary models.

General knowledge and reasoning tasks: ARC (C and E), MMLU, BoolQ, HellaSwag, OpenBookQA, CommonsenseQA, PIQA, SocialIQA.

Code completion tasks: HumanEval, MBPP.

All models are evaluated zero-shot or in a 5-shot setting for MMLU; code tasks use pass@10.

Model ARC-C ARC-E MMLU BoolQ HS OBQA CSQA PIQA SIQA HEval MBPP Avg.
Comma v0.1-1T 52.8 68.4 42.4 75.7 62.6 47.0 59.4 70.8 50.8 36.5 35.5 54.7
LLaMA-1 7B 44.5 67.9 34.8 75.4 76.2 51.2 61.8 77.2 50.3 19.9 27.9 53.4
MPT-7B 46.5 70.5 30.2 74.2 77.6 48.6 63.3 77.3 49.1 27.3 33.2 54.3
StableLM-7B 50.8 65.4 45.2 71.7 75.6 48.2 57.2 77.0 48.2 23.1 32.0 54.0
OpenLLaMA-7B 44.5 67.2 40.3 72.6 72.6 50.8 62.8 78.0 49.7 27.6 33.9 54.5

Interpretation: Comma v0.1-1T achieves parity or a slight edge over LLaMA-1 7B and other models across most knowledge, reasoning, and code metrics, especially on ARC (C), MMLU, HumanEval, and MBPP, despite being trained exclusively on auditable, open texts and not the unlicensed social or web content available to LLaMA. Some small deficits on tasks such as HellaSwag and PIQA are present, plausibly reflecting under-representation of informal domains in the Common Pile.

5. Design, Engineering, and Data Curation Principles

The engineering and curation choices in Comma v0.1-1T address both legal/ethical and scientific concerns:

  • Licensing: The entire corpus is traceable to explicit public domain or open source licenses, with non-compliant material excised.
  • Transparency: Auditability of all curation scripts, filtering stages, mixture weights, and per-source statistics.
  • Diversity: Inclusion of over two dozen high-quality sources, biasing toward scientific and non-fictional material where open licensing is most robust, while accepting a known coverage gap in topics less represented in open data (e.g., conversational, product reviews).
  • Filtering discipline: Multi-stage and per-source deduplication, strict sequence quality controls, and PII detection to satisfy privacy standards.

6. Significance and Future Directions

Comma v0.1-1T establishes that large LLMs can achieve competitive performance with unlicensed-data-trained baselines by carefully mixing, filtering, and curating a sufficiently diverse open-source and public domain dataset. This addresses long-standing legal, ethical, and reproducibility critiques of LLM deployment, and provides a template for future, auditable LLM data releases.

Some limitations remain in model coverage for genres and linguistic styles absent from open resources—particularly those that remain underrepresented due to copyright laws. A plausible implication is that augmenting open datasets with synthetic or curated dialog, or advocating for openly licensed contributions in these areas, could further narrow remaining gaps.

The open release of the dataset, curation code, training script, and full checkpoints provides an extensible and reproducible framework for both academic and commercial research, lowering barriers to truly open LLM research and deployment.