Papers
Topics
Authors
Recent
Search
2000 character limit reached

Comma v0.1: Open Llama-Style Models

Updated 23 June 2026
  • Comma v0.1 models are a family of large-scale, decoder-only Transformer language models that leverage entirely open, public-domain data for training.
  • They use a Llama-style architecture with 7B parameters, 32 layers, and a custom 64K-token BPE tokenizer to support diverse, multilingual content.
  • Their rigorous preprocessing, transparent data sourcing, and reproducible training pipeline set a new standard for open and legally robust LLM research.

The term Comma v0.1 models refers to a family of large-scale, decoder-only Transformer LLMs trained entirely on public-domain and openly licensed text, introduced alongside the Common Pile v0.1 dataset. These models provide an open alternative to Llama and similar architectures, attaining competitive performance using only transparent, legally unencumbered data—demonstrating that high-quality LLMs can be produced without reliance on proprietary or scraped content (Kandpal et al., 5 Jun 2025). The Comma v0.1 release includes both models and the entire training pipeline, supporting reproducibility and further research in the open science and NLP communities.

1. Model Family, Architecture, and Tokenization

Comma v0.1-1T and Comma v0.1-2T follow the Llama-style Transformer architecture at the 7B parameter scale:

  • Layers: 32 Transformer blocks
  • Hidden dimension (dmodeld_{model}): 4096
  • Feed-forward inner dimension: 11008
  • Attention heads: 32 (per-head dimension = 128)
  • Context window: 4096 tokens

Tokenization employs a custom BPE tokenizer with a 64,000-token vocabulary, trained on 600 GB of Common Pile data. Tokenization pre-processing matches Llama 3.2’s splitting paradigm and Hugging Face’s ByteLevel conventions, with no additional Unicode normalization. This enables coverage and granularity suited for the diverse, multilingual, and domain-varied corpus of public-domain content.

2. Training Data and Preprocessing (Common Pile v0.1)

Comma v0.1 models are trained exclusively on the Common Pile v0.1, an 8 TB corpus extracted from 30 sources and processed through rigorous cleaning:

Source Domains Example Corpora & Details
Research/Sci Text arXiv full-text & abstracts, PubMed Central
Public-domain Books Project Gutenberg, BHL, LoC Digitized, etc.
Government/Legal USGPO, Regulations.gov, CourtListener, USPTO
Online Discussion StackExchange, GitHub PR/issues, Ubuntu IRC
Supervisied Datasets Data Provenance Initiative curation
Education & Wikis Wikimedia, WikiTeam, OpenEd, PressBooks, LibreTexts
Code Subset of The Stack V2, PEPs
Transcribed Speech Whispered CC BY YouTube transcripts
Web/News/Reviews 52 CC-use-verified Common Crawl snapshots

Preprocessing encompasses:

  • FastText English language filtering (p>0.5p > 0.5)
  • DataComp-LM text quality gating
  • Length filtering (\geq100 words for most)
  • OCR and noisy content removal (unigram log-likelihood <20< -20)
  • Dual FastText-based toxicity filtering (>0.1>0.1 threshold)
  • PII removal via regex (emails, phones, IPs)
  • Fuzzy deduplication with a bloom-filter variant (>>90% 20-gram overlap)
  • Code-filtering: Red Pajama V1 coding heuristics, language-specific gating

Each corpus is independently processed with these steps prior to compositing the final pretraining mixture. This ensures the corpus is high quality, diverse, legally sound, and aligned with open-science objectives.

3. Training Procedure and Hyperparameters

Comma v0.1 models are trained to two distinct dataset scales:

  • Comma v0.1-1T: Trained on 1 trillion tokens sampled from the full Common Pile v0.1 mixture.
  • Comma v0.1-2T: Trained on 2 trillion tokens (same mixture, repeated).

Optimizer is AdamW with weight decay of 0.2, and hyperparameters closely parallel industry standards for Llama-sized models.

Training schedule:

  • Stage I: Cosine schedule with 2000-step warmup, 460k main steps (1T) or 230k (2T). η0=1×103\eta_0 = 1\times10^{-3}, ηmin=1×109\eta_{\min} = 1\times10^{-9}, period = 500k steps.
  • Stage II (cooldown): Linearly decaying LR to zero on a high-quality subset, for 18k (1T) or 9k (2T) steps.
  • Batch Size: 1T: 512×4096=2,097,152 tokens/step; 2T: 2048×4096=8,388,608 tokens/step.
  • No explicit report on accelerator hardware or total FLOPs; ablations are reported on AMD MI300A and similar accelerators.

4. Benchmarking and Empirical Results

Comma v0.1 models are evaluated zero-shot (5-shot for MMLU) against major reasoning, knowledge, and code challenges, via the OLMES suite.

Benchmark coverage: ARC-Challenge/Easy, MMLU, BoolQ, HellaSwag, OpenBookQA, CommonSenseQA, PIQA, SocialIQA, HumanEval, MBPP.

Performance Comparison

Model ARC-C ARC-E MMLU BoolQ HSwag OBQA CSQA PIQA SIQA HumanEval MBPP Mean
RPJ-7B 42.8 68.4 27.8 68.6 70.3 49.4 57.7 76.0 46.9 11.1 15.9 48.6
Llama 1 7B 44.5 67.9 34.8 75.4 76.2 51.2 61.8 77.2 50.3 19.9 27.9 53.4
StableLM-7B 50.8 65.4 45.2 71.7 75.6 48.2 57.2 77.0 48.2 23.1 32.0 54.0
MPT-7B 46.5 70.5 30.2 74.2 77.6 48.6 63.3 77.3 49.1 27.3 33.2 54.3
OpenLLaMA 44.5 67.2 40.3 72.6 72.6 50.8 62.8 78.0 49.7 27.6 33.9 54.5
Comma 1T 52.8 68.4 42.4 75.7 62.6 47.0 59.4 70.8 50.8 36.5 35.5 54.7
Llama 2 7B 48.5 69.5 45.8 80.2 76.2 48.4 62.8 76.7 50.8 26.1 28.5 55.8
Comma 2T 45.8 71.8 49.8 78.6 64.4 46.2 64.0 72.5 52.3 44.2 41.5 57.4

Comma v0.1-1T sets the highest mean among all 1T-token, 7B models examined, with particularly strong results on ARC-Challenge, HumanEval, and MBPP. Comma v0.1-2T maintains competitive standing, leading on 4 out of 11 benchmarks for models trained with 2T tokens.

Ablations using controlled data experiments (on 1.7B-param models, 28B tokens) indicate that the Common Pile architecture surpasses other open corpora (OLC, Oscar, Common Corpus), nearly matches or exceeds Pile on most early-signal tasks, and strictly dominates Pile/OSCAR on scientific tasks.

5. Analysis and Design Implications

Empirical findings demonstrate:

  • Openly licensed pretraining at the scale of Common Pile v0.1 is sufficient for producing Llama-class LLMs with no statistically significant deficit in general performance.
  • The model’s domain profile mirrors corpus composition: outperforming in scientific and reasoning tasks (MMLU, ARC) but trailing on benchmarks aligned with underrepresented text types (blogs/tutorials/sports in HellaSwag, PIQA).
  • Exclusion of supervised task data and inclusion/exclusion of curriculum do not materially affect results.
  • Preprocessing and deduplication procedures are critical for maximizing model quality and utility of open-corpus resources. A plausible implication is that further broadening of openly licensed general-domain (esp. conversational, creative, and technical writing) will close the narrow remaining performance deficits to closed-source web-scale models.

6. Implications, Applications, and Future Directions

The release of Comma v0.1 represents a key step in de-risking LLM research by decoupling pretraining from unlicensed data, offering an auditable, extensible baseline for both academic and industrial deployments requiring regulatory compliance or provenance transparency.

Potential applications include:

  • Foundation modeling for both generative tasks and downstream specialization in research, bioinformatics, legal NLP, and scientific QA
  • Platforms requiring verifiable data provenance and license auditability (e.g., government, health, education)
  • Meta-modeling: providing a base for comparative ablations and curriculum design in open-data LLM pretraining.

Ongoing challenges center on:

  • Collection of high-quality open conversational, subjective, and creative texts to further ameliorate performance gaps on certain commonsense and reasoning benchmarks.
  • Expansion of code data and international corpora while maintaining filtering at scale.

Widespread content creators’ adoption of open licensing, coupled with continued expansion of public-domain corpora, is essential for the next generation of legally robust, fully open LLMs (Kandpal et al., 5 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Comma v0.1 Models.