Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 56 tok/s
Gemini 2.5 Pro 38 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 420 tok/s Pro
Claude Sonnet 4.5 30 tok/s Pro
2000 character limit reached

DCLM-Baseline: Web Pretraining Benchmark

Updated 3 October 2025
  • DCLM-Baseline is a large-scale, meticulously curated web corpus that prioritizes high data quality through advanced filtering and deduplication techniques.
  • Its training methodology uses a standard decoder-only Transformer with compute scaling per the Chinchilla rule, achieving competitive accuracy on 53 benchmark tasks.
  • Ablation studies and performance analyses establish the dataset as a robust reference point for reproducible, efficient language model pretraining.

The DCLM-Baseline dataset is a large-scale, meticulously curated web-pretraining corpus developed as part of the DataComp-LM benchmark. It is designed to provide a standard, openly reproducible baseline for pretraining LLMs in a setting where data curation quality is a central axis of model performance. Models trained on DCLM-Baseline have demonstrated strong results that rival closed-source foundation models, while significantly reducing training compute. This dataset has been extensively adopted as a reference in open pretraining comparisons and has influenced best practices in web-scale corpus construction, quality filtering, and deduplication.

1. Corpus Composition and Preprocessing

The DCLM-Baseline dataset is constructed from a 240 trillion token sub-corpus of Common Crawl (all web data, English, prior to 2023), representing the world’s largest open web crawl. Key features of its construction include:

  • Extraction: The extraction pipeline uses resiliparse for HTML-to-text conversion. Resiliparse extraction yields significant downstream performance gains compared to WET files or trafilatura (Core scores improved by approximately 2.5 points) and is ∼8x faster.
  • Initial Pool: The raw DCLM-Pool encompasses ∼200 billion documents (∼370TB compressed). Only document main bodies (excluding navigation, boilerplate, and spam content) are retained using heuristics inspired by RefinedWeb.
  • Deduplication: Multi-stage deduplication is employed. At the document level, a scalable Bloom filter–based approach removes both exact and near-duplicates using n-gram signatures. Paragraph-level deduplication is also performed, further increasing the novelty and reducing redundancy.
  • Quality Filtering: Documents are quality ranked using a fastText-based binary classifier trained on a mixture of OH-2.5 instruction-formatted samples and high-quality data from r/ExplainLikeImFive. Only the top ∼10% (by classifier score) of documents are selected for inclusion.

This pipeline produces a highly filtered dataset that preserves both the diversity and the composition of the open web, with an explicit focus on maximizing informativeness per token.

2. Training Methodology and Scaling

The DCLM-Baseline dataset is used in controlled experiments, central to the DataComp-LM testbed, to measure the impact of data curation on LLMs:

  • Model Architecture: Experiments employ a standard decoder-only Transformer, as implemented in OpenLM, with configurations comparable to GPT-2 or Llama-family architectures.
  • Compute Scaling: Model sizes range from 412M to 7B parameters. The pretraining token budget is determined by the “Chinchilla rule”: number of tokens D=20×ND = 20 \times N (where NN = number of parameters), scaled by a multiplier. Compute cost is estimated as FLOPs=6ND\mathrm{FLOPs} = 6ND.
  • Evaluation Suite: Performance is assessed on 53 downstream natural language understanding tasks, including MMLU, SQuAD, HellaSwag, COPA, TriviaQA, and mathematical reasoning tasks. Core and Extended metrics aggregate results across 22 and 53 tasks, respectively.

The training and evaluation pipeline is strictly standardized, enabling reproducible, fair comparison of data curation choices at fixed compute budgets.

3. Performance and Comparative Analysis

The DCLM-Baseline 7B model pretrained on 2.6T tokens achieves:

  • MMLU (5-shot): 64% accuracy, a 6.6 percentage-point improvement over MAP-Neo (previous open-data SOTA) at 40% less compute.
  • Downstream Tasks: Comparable results to closed data models—Mistral-7B-v0.3 (63%), Llama 3 8B (66%)—across 53 language understanding benchmarks.
  • Compute Efficiency: Approximately 6.6× less compute than Llama 3 8B at similar accuracy.

Averaged performance across the full evaluation suite is competitive, with DCLM-Baseline models matching or approaching best-in-class systems on “Core” and “Extended” grouped metrics. Ablation experiments reveal that fastText-based model filtering confers a ∼3.5 point advantage on the Core metric over using only reference data pools (Wikipedia, GPT-3 approximations).

4. Data Curation Strategies and Ablations

DCLM-Baseline explicitly isolates the effect of data quality interventions. Experimental variants include:

  • Filtering Methods: PageRank-based, perplexity-based, BGE feature classifier, AskLLM prompt-based, top-k average logit, and fastText. The fastText classifier—trained with a blend of instructional and community-formatted text—outperforms alternatives in downstream model quality.
  • Deduplication: Multiple deduplication pipelines are compared: MinHash, suffix arrays, and Bloom filter. The Bloom filter approach is adopted for scalability and practical effectiveness at scale.
  • Mixing Strategies: Mixing in additional high-quality data sources (Wikipedia, Stack Exchange, arXiv) sometimes degrades performance when applied to the well-curated DCLM-Baseline. This result suggests the primacy of internally consistent, high-quality web pools over noisy or heterogeneous data mixing.

The ablation studies reinforce that model-based filtering and deduplication are primary axes controlling final model performance.

5. Adoption as Reference Dataset and Benchmarks

DCLM-Baseline has been adopted as a strong reference dataset for open pretrained LLM experiments:

  • Zyda-2 (5T tokens, (Tokpanov et al., 9 Nov 2024)): DCLM was a major data source in the Zyda-2 pipeline, contributing 3.348B tokens after cross-deduplication (with an 85% Jaccard similarity LSH filter). It required no further filtering, having already undergone strong, education-oriented quality assurance.
  • open-sci-ref-0.01 (Nezhurina et al., 10 Sep 2025): In systematic side-by-side evaluations of eight open datasets, DCLM-Baseline was consistently ranked near the top (typically after NemoTron-CC-HQ) for LLM pretraining performance. For a 1.7B model trained on 1T tokens, results include an average evaluation score of 0.57 (COPA: 0.79, Lambada: 0.68), with some tasks favoring DCLM-Baseline even relative to the top dataset.
  • Nemotron-CC (Su et al., 3 Dec 2024): DCLM-Baseline functions as a reference for aggressive filtering approaches. Newer pipelines (e.g., Nemotron-CC) increase unique token yields through classifier ensembles and synthetic rephrasing and are benchmarked explicitly against DCLM-Baseline on token diversity and long-horizon performance.

These studies attest to the durability of DCLM-Baseline as a foundation for scale-sensitive, reliable benchmarking in the LLM pretraining literature.

Benchmark Performance (DCLM-Baseline) Comparison
MMLU (5-shot, 7B) 64% +6.6 pts vs MAP-Neo; ~Llama 3 8B
Average (1.7B, 1T tokens) 0.57 2nd after NemoTron-CC-HQ
Core Centered Acc High Matches Mistral-7B

6. Limitations, Successors, and Evolving Best Practices

Although DCLM-Baseline is an influential standard, recent work has identified key limitations and areas for refinement:

  • Aggressive Filtering: Approximately 90% of raw tokens are discarded through fastText or heuristic filtering, prioritizing quality at the expense of diversity and token horizon. This can hinder performance in long-horizon, multi-epoch training regimes.
  • Token Diversity: Successors such as Nemotron-CC (Su et al., 3 Dec 2024) employ classifier ensembling and synthetic rephrasing to produce 4× more unique real tokens. For very high budget training (≥15T tokens), these approaches yield measurably higher average performance (e.g., +5 on MMLU at 8B parameter scale).
  • Synthetic Augmentation: DCLM-Baseline is constructed from real web text, with little to no synthetic augmentation. Generative rephrasing, question-answer expansion, and instruction-style synthesis are now used to enhance token coverage without loss of quality.

These limitations are currently being addressed in new datasets and benchmarks, which incorporate DCLM-Baseline as a competitive reference while targeting further gains through improved curation and augmentation schemes.

7. Implications and Future Directions

The DCLM-Baseline dataset set a new standard in open LLM pretraining, emphasizing:

  • The centrality of data curation quality; model-level gains are increasingly determined by filtering and deduplication advances.
  • The necessity of compute-efficient training pipelines with standardized evaluation, facilitating direct attribution of performance improvements to data rather than architecture or hyperparameters.
  • The value of robust baselines for ecosystem benchmarking—enabling reproducible, scalable comparison across future pretraining regimes.

Ongoing research focuses on scaling up the DCLM approach to longer horizons, dynamically integrating privacy/safety constraints, incorporating domain-specific modules (e.g., math, code, science), and developing richer synthetic augmentation. The influence of DCLM-Baseline persists as newer datasets are measured against it for both average accuracy and resource efficiency, and for their ability to provide robust scaling trends across model regimes.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DCLM-Baseline Dataset.