DataComp-LM Benchmark
- The paper introduces DataComp-LM, a scalable benchmark that isolates the impact of data curation strategies on Transformer performance, achieving a 6.6 percentage point MMLU gain over previous baselines.
- DataComp-LM Benchmark is a data-centric platform built on 240 trillion tokens from Common Crawl and 53 downstream tasks, enabling reproducible pretraining experiments across various model scales.
- Key techniques include advanced deduplication (MinHash and BFF), heuristic and fastText-based filtering, and fixed pretraining recipes, which together optimize training efficiency and model performance.
DataComp-LM (DCLM) is a data-centric benchmark designed to systematically isolate and quantify the impact of training-set curation strategies on the performance of decoder-only Transformer LLMs at scales ranging from approximately 400 million to 7 billion parameters. DCLM combines a massive, standardized corpus—240 trillion tokens extracted from Common Crawl using an open-source extraction pipeline—with a set of fixed pretraining recipes and a comprehensive suite of 53 downstream natural language understanding tasks. The benchmark enables comparative evaluation of data curation techniques, including deduplication, filtering, data mixing, and model-based quality selection. It is positioned as a rigorous, reproducible platform for controlled dataset experiments in LLM research (Li et al., 2024).
1. Dataset Construction and Preprocessing
DCLM uses the DCLM-Pool as its foundational dataset—a web-text corpus sourced from all Common Crawl WARC dumps prior to 2023, parsed with Resiliparse. This results in roughly 200 billion documents (370 TB compressed) and approximately 240 trillion GPT-NeoX tokens. The extraction pipeline is designed to maximize both dataset scale and diversity.
To minimize memorization and enhance data diversity, DCLM employs multiple deduplication strategies:
- MinHash Deduplication: Used in early-stage experiments to approximate Jaccard similarity, with a collision probability tuned (r=15, b=93) to emulate (450,20)-band schemes but at reduced computational cost. Documents with J(A, B) < 0.8 are retained.
- Big-Friendly Filter (BFF): A scalable Bloom filter approach for document and intra-document deduplication, targeting a false-positive rate of ε=10{-2} over n ≈ 10{12} tokens. Documents or paragraphs with over 80% flagged “seen” n-grams are excluded.
Heuristic filters modeled after RefinedWeb remove boilerplate content, non-English pages, and documents outside predetermined length boundaries. DCLM further minimizes contamination between pretraining and evaluation via n-gram overlap detection tools adapted from Lee et al. (2021).
2. Model-Based and Heuristic Data Filtering
A distinguishing feature of DCLM is its use of model-based data quality filtering. A fastText binary classifier, , is trained to discriminate high-quality “reference” documents (from OH-2.5 instruction data and top ExplainLikeImFive posts) versus generic web text. All documents are scored by , and only the top 10% (by score) are retained, that is, .
Ablation studies reveal best performance is achieved with unigram+bigram models and a 10% threshold; broader inclusion (15–20%) degrades “Core” benchmark performance. Alternative strategies—perplexity filtering, LLM prompting (AskLLM), PageRank, semantic deduplication (SemDeDup), BGE embedding classifiers, and “top-k average logits”—were tested but were consistently outperformed by the fastText-based approach.
Data mixing is permitted in a distinct “mixing track,” allowing arbitrary combinations of DCLM-Pool and external high-quality corpora (such as Wikipedia, Books, arXiv, or GitHub). However, in DCLM-baseline—which relies on strictly filtered Common Crawl data—incorporation of external sources showed no further improvement; stringent model-based filtering of web text was the most determinative factor.
3. Pretraining Recipes and Fixed Architectures
Pretraining within DCLM adheres to the OpenLM framework, specifying decoder-only Transformer architectures with pre-norm, SwiGLU MLPs, and qk-LayerNorm. For a model with parameters and tokens (with Chinchilla scaling: ), the total compute is $6ND$ FLOPs. DCLM fixes hyperparameters and optimization schedules per scale, as summarized:
| Scale | Parameters (N) | Tokens (D) | FLOPs ($6ND$) | Seq. Batch | Warmup Steps | LR | WD | z-loss |
|---|---|---|---|---|---|---|---|---|
| 0.4B | 412M | 8.2B | 512 | 2K | 0.033 | |||
| 1.4B | 1.4B | 28.8B | 256 | 5K | 0.033 | |||
| 6.9B | 6.9B | 138/276B | 2048 | 5K | 0.05 |
Tokenization uses the GPT-NeoX byte-pair vocabulary (50K tokens). The learning rate follows a linear-cosine decay to a final “cool-down” value ( for 7B models; for others).
4. Benchmarking, Evaluation, and Scoring
DCLM evaluates each submission on a suite of 53 zero- and few-shot tasks using LLM-Foundry. Of these, 22 “Core” tasks (e.g., BoolQ, HellaSwag, ARC-E/A, Big-Bench subsets) are selected for low inter-run variance. The primary metric is MMLU 5-shot accuracy, following a standardized protocol:
- For questions per domain with options, 5 exemplars are prepended as context.
- Each candidate answer is scored by the aggregate log-probability of its token sequence.
- The predicted answer is .
Accuracy is the proportion of correct predictions across all samples. To normalize across heterogeneous tasks, DCLM employs “centered accuracy”: for observed accuracy and random baseline , with overall Core or Extended averages taken across the corresponding task sets.
5. Baseline Results and Comparative Performance
The DCLM-baseline dataset, constructed via Resiliparse extraction, heuristic filtering, BFF deduplication, and fastText (OH-2.5+ELI5) filtering at a 10% cutoff, establishes a new state-of-the-art for open-data LLM pretraining. For a 7B model:
- Trained with 2.6T tokens and FLOPs (representing 40% less compute than MAP-Neo).
- Achieves 63.7% MMLU 5-shot (a 6.6 pp improvement over MAP-Neo, which scored 57.1% using 4.5T tokens and FLOPs).
- Attains 45.4 on the DCLM “Core” suite.
Closed-data models such as Mistral-7B-v0.3 and LLaMA 3 8B yield 62.7% and 66.2% MMLU scores, respectively, but at considerably higher compute (LLaMA 3 8B with 15T tokens and over 7× DCLM’s compute). Notably, the DCLM-baseline matches closed-data performance at a fraction of the computational and data resources.
| Model (7B scale) | Tokens (T) | Compute (FLOPs) | MMLU 5-Shot (%) | Core |
|---|---|---|---|---|
| MAP-Neo (open data) | 4.5 | 57.1 | 40.4 | |
| DCLM-Baseline | 2.6 | 63.7 | 45.4 | |
| Mistral-7B-v0.3 (closed) | ? | ? | 62.7 | 57.0 |
| LLaMA 3 8B (closed) | 15 | 7× DCLM | 66.2 | 57.6 |
6. Key Insights and Recommendations
Empirical results place particular emphasis on model-based filtering; the fastText OH-2.5+ELI5 classifier at a top 10% threshold accounts for more than 6 percentage points gain on MMLU versus the best prior open corpus. Deduplication and the details of HTML extraction (Resiliparse vs. WET parsing, BFF scalability) are also determinative.
The addition of high-quality reference corpora (Wikipedia, books, etc.) can marginally improve performance under less stringent filtering, but under aggressive model-based filtering, further data mixing is either neutral or detrimental to downstream metrics. Human page quality judgements do not correlate strongly with benchmark gains, implying a decoupling between human and algorithmic quality metrics within this context.
A carefully filtered 2.6T-token DCLM-baseline corpus enables a 7B model to attain 64% MMLU 5-shot accuracy in 3700 H100 GPU-hours, compared to 6900 hours for LLaMA 3 8B.
7. Future Directions and Open Research Questions
Recommendations for continued research include extending DCLM toward code- and mathematics-focused textual data, enhancing multilingual and fairness-aware filtering mechanisms, improving toxicity and personally identifiable information suppression, integrating synthetic data, and evaluating outcomes at substantially larger model scales. The methodological foundation and open resources of DCLM are intended to support systematic, reproducible study of next-generation pretraining corpora (Li et al., 2024).