Papers
Topics
Authors
Recent
Search
2000 character limit reached

DataComp for LMs Benchmark

Updated 18 March 2026
  • DataComp for Language Models (DCLM) is an open-data benchmark that evaluates dataset curation—filtering, deduplication, mixing, and augmentation—on LLM performance.
  • It uses fixed model architectures and training recipes, enabling apples-to-apples comparisons across 53 downstream tasks under controlled compute budgets.
  • Advanced strategies such as BETR and REWIRE illustrate that targeted data selection and synthetic augmentation can boost efficiency and accuracy in LLM training.

DataComp for LLMs (DCLM) constitutes an open-data benchmark and testbed for systematically evaluating the impact of training-set design on the downstream performance of decoder-only LLMs. Central to DCLM is a 240-trillion-token corpus culled from Common Crawl and a set of standardized model architectures and training recipes. DCLM's primary aim is to isolate data curation—filtering, deduplication, mixing, and augmentation—as the experimental variable, while controlling for architecture and optimization. This framework establishes compute- and data-efficiency as core axes of competition, facilitating apples-to-apples comparisons between dataset construction methodologies and enabling reproducible progress in LLM training (Li et al., 2024).

1. DCLM Benchmark Structure and Objectives

DCLM presents both a large-scale common pool ("DCLM-Pool") and a standardized benchmark workflow. Its foundation is a raw corpus of 240 trillion GPT-NeoX tokens (≈370 TB gzip-compressed HTML), extracted via the Resiliparse HTML parser from all Common Crawl archives up to 2022, with minimal filtering for robots.txt compliance. To prevent inadvertent test contamination, DCLM supplies decontamination tools based on substring hashing and deduplication, but does not remove overlaps by default (Li et al., 2024).

Participants are provided with fixed model architectures (pre-norm decoder-only Transformers) and training parameters via the OpenLM codebase, with scales ranging from 412 million to 7 billion parameters. Each submission is evaluated against a broad, multi-domain suite of 53 downstream tasks, enabling rigorous comparison of dataset curation strategies independent of model or optimization confounders (Li et al., 2024).

Compute budgets (as measured in FLOPs) and training corpus sizes are fixed per model scale, following Chinchilla scaling laws (e.g., for 7B parameter models at 1×: 138B tokens).

2. Data Curation Methodologies

DCLM supports diverse data curation approaches, with a particular focus on filtering and deduplication as key determinants of model performance. Notable strategies include:

  • Heuristic Filtering: Replicates RefinedWeb-style preprocessing—language detection (fastText), boilerplate removal, URL filtering, and regex heuristics.
  • Deduplication: Two-level Bloom filter and n-gram based duplicate removal, with paragraphs and documents scanned for repeated content at a window length of 13 tokens, thresholded at 0.8 for fractional duplication.
  • Model-Based Filtering: Supervised fastText classifier (trained on OpenHermes-2.5 and ELI5 as positives, and heuristic pool negatives) assigns a probability-based quality score. The top 10% of documents are retained.
  • Alternative Scoring: DCLM evaluates perplexity filtering, top-k logit scoring, AskLLM (LLM-in-the-loop annotation), semantic deduplication, and PageRank-based criteria. Ablations confirm fastText classifier filtering consistently outperforms other alternatives by 2–4 percentage points on core evaluation metrics (Li et al., 2024).

A summary of curation and filtering steps is presented in the following table:

Step Description Outcome
Heuristic Filtering Language detection, boilerplate removal, URL/regex filtering RefinedWeb subset
Deduplication Bloom filter, n-gram matching, paragraph/document thresholds Reduced redundancy
Model-Based Filtering fastText supervised scoring (top 10%) High-quality dataset

After filtering, DCLM-Baseline comprises approximately 2.6 trillion tokens (≈1% of the raw pool), enabling a 7B decoder-only model to achieve 64% 5-shot MMLU with 2.6T tokens, matching or exceeding closed-weight baselines such as Mistral-7B-v0.3 and Llama 3 8B with an order-of-magnitude less compute (Li et al., 2024).

3. Advanced Data Selection: Benchmark-Targeted Ranking (BETR)

BETR ("Benchmark-Targeted Ranking") introduces explicit benchmark-driven selection of pretraining data, departing from implicit, proxy-based strategies. It operationalizes data selection as a ranking problem, explicitly conditioning on downstream benchmark examples (Mizrahi et al., 16 Jul 2025).

BETR Workflow

  1. Embedding Construction:
    • Sample 10 million documents from DCLM-Pool.
    • Individually embed both document candidates and benchmark training instances with a transformer-based encoder (e.g., Arctic-Embed L 2.0, GTE Large v1.5).
  2. Similarity Scoring:
    • Compute cosine similarity for each document-benchmark pair.
    • Rank documents for each benchmark; map ranks via a non-increasing value function (e.g., v(r)=1/r).
    • Aggregate using a max operator to reward high-similarity near-matches:

    Sj=max1iMv(rij)S_j = \max_{1 \leq i \leq M} v(r_{ij})

  3. Labeling and Classifier Extension:

    • Label top 10% by SjS_j as positive; bottom 90% as negative.
    • Train fastText classifier on these labels to score the entire corpus.
    • Filter by retaining top x% of tokens, with x tuned to compute scale (Mizrahi et al., 16 Jul 2025).

Comparative Analysis with DCLM-Baseline

Where DCLM-Baseline uses generic, style/quality-proxy positives (ELI5/OpenHermes-2.5) and negatives from RefinedWeb, BETR targets actual downstream objectives by maximizing alignment between pretraining data and benchmark task distributions.

Quantitatively, BETR yields a mean compute multiplier of 2.1× over DCLM-Baseline, and 4.7× over unfiltered data, consistently improving performance on 9 of 10 core benchmarks. At 7B-10x scale, this corresponds to 1.7–1.8 percentage point accuracy gains, with speedups of 1.8–2.8× across scales (Mizrahi et al., 16 Jul 2025).

Scaling law analysis models loss and accuracy via fitted power and sigmoid functions, mapping FLOPs to optimal hyperparameter (filtering aggressiveness) schedules. The optimal retained fraction increases substantially with larger model and data regimes (e.g., top 3% at 102010^{20} FLOPs, top 10% at 102210^{22}, up to top 30% at 102310^{23}) (Mizrahi et al., 16 Jul 2025).

4. Synthetic Data Generation and Augmentation

DCLM supports research on synthetic data generation and augmentation, as exemplified by the REWIRE pipeline. REWIRE targets filtered-out, moderate-quality web documents, prompting a state-of-the-art LLM (Llama-3.3-70B-Instruct) to generate chain-of-thought rewrites, yielding approximately 400B synthetic tokens (Nguyen et al., 5 Jun 2025).

Post-generation, a secondary fastText classifier further filters these rewrites, and the final pretraining mix combines "high-quality" raw and synthetic tokens, typically at a 1:1 ratio. Across 1B, 3B, and 7B model scales, REWIRE mixes yield 1.0, 1.3, and 2.5 pp gains (CORE metric, 22 tasks) compared to natural data alone, and outperform doubling the pool of natural data. Notably, 82% of final rewrites originate from documents that would otherwise be discarded. Gains increase with scale and are robust across diverse evaluation objectives (Nguyen et al., 5 Jun 2025).

Compared to other synthetic approaches (e.g., QA-synthesis, Wikipedia-style rephrasing), HQ Raw + REWIRE consistently outperforms matching-scale alternatives (Nguyen et al., 5 Jun 2025).

5. Evaluation Suite and Metrics

DCLM models are assessed using a standardized harness built on LLM-Foundry, spanning 53 downstream tasks that probe factuality, reasoning, language understanding, and code. Primary metrics include:

  • 5-shot MMLU accuracy: Evaluated over 40+ subjects, with top models achieving 64% on the open-data 7B-2x baseline (Li et al., 2024).
  • Core Centered Accuracy: Averaged, scaled performance over 22 low-variance tasks, with rescaled accuracies (random = 0, perfect = 1).
  • Extended Centered Accuracy: Analogous metric across all 53 tasks.
  • Specialized Scores: e.g., TruthfulQA and few-shot world knowledge tasks.

These metrics are designed for low variance and generality, ensuring that curation effects transfer beyond any single domain (Li et al., 2024).

6. Insights, Generalization, and Recommendations

Empirical investigation within DCLM underscores several key findings:

  • Model-Based Filtering Dominates: FastText classifier filtering, using instruction-tuned data as positives, remains the most effective heuristic.
  • Task-Targeted Selection Accelerates Progress: BETR demonstrates that explicit alignment to downstream benchmarks nearly doubles compute efficiency over baselines.
  • Scale-Adaptive Data Retention: Larger models benefit from less aggressive filtering; smaller models require more stringent data selection.
  • Generalization: BETR’s Evaluation-Blind variant (targeting disjoint benchmark sets) generalizes well, matching or exceeding baseline performance on held-out tasks if enough targets are used—confirming that broad benchmark selection builds versatile models, whereas narrow targeting can induce overspecialization.
  • Synthetic Augmentation is Scalable and Effective: REWIRE enables large lifts in usable data supply, particularly as natural web growth stagnates and pretraining compute accelerates.
  • Workflow Efficiency: Lightweight, bag-of-ngrams scorers (fastText) outperform heavier regressor/classifier models for scaling to 100B+ token corpora (Mizrahi et al., 16 Jul 2025, Nguyen et al., 5 Jun 2025).

Open questions remain regarding the efficacy of these methods for multilingual or code LLMs, continuous (rather than hard) weighting of data, and the source of fastText’s surprising advantage in selection quality (Mizrahi et al., 16 Jul 2025). Extension to 30B+ parameter models and specialized fairness/toxicity/privacy filtering is explicitly flagged as future work (Li et al., 2024).

7. Impact and Future Directions

DCLM sets a new paradigm for open-data LLM benchmarking by isolating and systematically evaluating dataset construction. Its findings decisively show that aggressive research into data selection, filtering, and augmentation can yield larger aggregate performance gains than simply scaling token count or model size. With open-sourcing of code, corpora, and leaderboards, DCLM invites further collaborative refinement—especially in multilingual, code, and domain-specific settings—and stands as a foundation for reproducible, data-driven advancement in language modeling (Li et al., 2024, Mizrahi et al., 16 Jul 2025).

Key recommendations for future DCLM-style pipelines include explicit task alignment, scale-adaptive filtering, prioritization of broad benchmark diversity for generalist models, and leveraging lightweight classifier-based scoring for whole-corpus extension. Addressing the impending “data wall” with methods such as REWIRE is central to sustaining continued advancement in LLMs as compute scales increase (Nguyen et al., 5 Jun 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DataComp for Language Models (DCLM).