A Bitter Lesson for Data Filtering

Published 19 May 2026 in cs.LG and cs.AI | (2605.19407v1)

Abstract: We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper demonstrates that unfiltered data, when used with sufficient compute and model capacity, achieves lower validation loss compared to filtered datasets.
It employs extensive experiments across various scales and noise injections to reveal that large models can effectively leverage even low-quality data.
The study derives scaling laws indicating that as compute increases, traditional data filtering becomes less beneficial, challenging established data curation practices.

The Limits of Data Filtering in Large-Scale LLM Pretraining

Introduction

This paper, "A Bitter Lesson for Data Filtering" (2605.19407), presents an extensive empirical and theoretical investigation into the necessity and efficacy of data filtering in large-scale LLM pretraining. Contrary to prevailing expert intuition and standard practice which typically emphasize stringent data filtering to maximize "data quality," the authors rigorously demonstrate that, in the high-compute regime, aggressive data filtering not only becomes unnecessary, but can even be suboptimal. In sufficiently large models trained for sufficient length, even data typically considered "low-quality" or "junk" may be beneficial or, at worst, not harmful.

Experimental Methodology and Setup

The study utilizes various randomly sampled subsets of the full pre-2023 Common Crawl (CC) corpus, scaling from 670 million to 10 billion tokens, and applies a variety of standard filtering pipelines (e.g., DCLM-Baseline, RefinedWeb). Model architectures range from 15M to 7B parameter dense transformers in the Llama-style, trained with standard hyperparameter tuning and model scaling protocols. The loss metric is primarily validation NLL on multiple benchmark datasets (C4 English portion, Fineweb-Edu, and Cosmopedia), with secondary consideration to downstream tasks (ARC-Easy, PIQA, SocialIQA).

The experiments sweep compute budgets both by scaling model size and by varying training steps (and thus, implicitly, number of epochs), while meticulously maintaining the relative size and filter ratios across datasets for fair comparisons. Supplementary experiments include deliberate injection of purely stochastic or heavily permuted data ("junk data") to probe model robustness.

Key Findings and Numerical Results

1. Unfiltered Data Outperforms Filtered Data at Sufficient Scale

The main empirical finding is robust: in the high-compute limit, models trained directly on unfiltered Common Crawl universally outperform their filtered counterparts across multiple scaling axes. For example, on a 1B parameter model, the unfiltered CC subset achieves an average validation loss of 3.37, improving upon all filtered pools at the same scale. This effect intensifies as model scale increases and with more training steps—a consistent manifestation across two orders of magnitude in data and model size.

2. Robustness to Low-Quality and Junk Data

Injecting low-quality or random data, including randomly sampled word strings and documents with shuffled word order, reveals an unexpected level of resilience. While smaller models show clear performance degradation proportional to the volume of noise, sufficiently large models, given enough optimization steps, close the gap entirely with clean data performance. In several configurations, models trained on mixtures with substantial proportions (even up to eight times) of "junk" data ultimately match or exceed the baseline on clean data, provided model capacity is high and training is sufficiently extensive.

3. Scaling Laws and Compute Projections

The authors derive empirical scaling laws that relate data pool size, model size, and number of training steps to the crossing point where unfiltered data overtakes filtered data in performance. Notably, these scaling laws are approximately linear in compute, predicting that the entire 240T-token CC pool becomes optimal at pretraining compute levels around $10^{30}$ FLOPs. Although such compute remains aspirational, forecasts suggest feasibility within this decade.

4. Distribution Shift and Degradation Edge Cases

Experiments investigating synthetically induced distribution shifts (e.g., shuffling word order) show that the observed robustness mostly extends to such cases, except on specific metrics that heavily depend on word order or initial token prediction. However, these effects are minor in typical LM applications.

5. Theoretical Justification

A formal analysis in the simplified setting of rank-constrained matrix factorization supports the empirical findings: models of sufficient capacity can allocate enough representational degrees of freedom to partition "good" from "bad" data, learning to ignore noise and extract signal even when signal is sparse. Theoretical models further illustrate when filtering could be beneficial—specifically, when the prevalence of incorrect or adversarially mislabeled data is high and indistinguishable from signal with current inductive biases.

Implications and Contradictory Claims

The study makes the strong and explicit claim: with sufficient compute and model size, the optimal data filter for LLM pretraining is, in practice, "no filter." This assertion directly challenges standard paradigms in data curation, which typically invest heavily in large-scale filtering operations premised on maximizing so-called data quality. The findings empirically refute the hypothesis that such filtering universally improves downstream metrics, showing that larger model capacity and compute alleviate or completely nullify the previously observed benefits of filtering.

Additional findings contradict prior work suggesting a persistent need for filtering strategies even at large scale, instead situating filtered datasets as only locally optimal in low-compute contexts.

Practical and Theoretical Implications

Practical implications are significant for both budget allocation in large-scale pretraining and dataset construction practices. Over-filtering, as commonly practiced, may waste potentially useful data and delay the attainment of optimal model performance, especially as scaling trends push toward ever larger models and compute budgets. Compute resources might be better spent training larger models for longer on maximal data pools rather than investing in additional data engineering efforts. The work highlights, however, that filtering remains necessary when harmful data (e.g., systematically incorrect or adversarial material) is prevalent—although, the authors find no evidence of this being common in current CC extractions.

On the theoretical side, the work reinforces Sutton’s "bitter lesson": in the long run, scalable, optimization-driven methodologies that eschew human-coded heuristics in favor of ingesting more raw data and compute outperform finely engineered, heavily supervised alternatives. The findings urge a reconsideration of inductive biases about what constitutes "useful" data and about model robustness to noise in the high-capacity regime.

Limitations and Future Directions

The study acknowledges several limitations. All experiments use dense transformer architectures; more complex, potentially less stable models (e.g., MoEs) may respond differently to low-quality data. Additionally, the compute requirements for the crossing point favoring unfiltered data are substantial, so in practical, compute-constrained settings, filtering still yields benefits. The projected increase of AI-generated content in web data is another open variable.

Potential future work includes:

Extending experiments to newer architectures and hybrid finetuning regimes.
Evaluating the impact of increasingly sophisticated or adversarial noise distributions.
Exploring the effects of synthetic and heavily engineered data in contrast to organic web-scale corpora.
Revising theoretical frameworks to accommodate more complex real-world signal/noise structures.

Conclusion

This paper provides compelling evidence that, for large-scale LLM pretraining, filtering for data “quality” is ultimately rendered obsolete by scaling model size and compute. Sufficiently large models not only tolerate noise, repetition, and even synthetic "junk," but also extract additional signal from data pools previously considered not usable. These results deliver a strong, quantitatively validated challenge to established best practices in data-centric ML, demonstrating that the returns of unfiltered data—when paired with adequate computational resources—ultimately surpass those obtainable from filtered corpora in the high-compute regime. Future model scaling endeavors and dataset construction should internalize these findings, with “less filtering, more data” as the guiding principle if compute permits.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about (big picture)

This paper asks a simple question with a surprising answer: Do we really need to clean (filter) the internet data we use to train very LLMs? The authors find that, if your model is big enough and you train it long enough, using almost everything—messy web data included—can beat carefully cleaned datasets.

Think of it like learning a language by reading the whole library, messy magazines and all, versus only reading the “best” books. The result here says: if you’re a very strong learner with lots of study time, reading the whole library eventually wins.

What the researchers wanted to know

They focused on a few plain questions:

Is filtering web data actually necessary when you have huge models and lots of compute (computer power)?
Could using the full, messy web (Common Crawl) outperform popular filtered datasets?
How much “junk” can a model handle—like random strings or documents with words shuffled out of order?
How do model size, training time, and data size trade off with each other?
Can we predict when “no filtering” becomes the best strategy?

How they tested it (in everyday terms)

Here’s the setup, translated into simple ideas:

Data “pool”: They use Common Crawl (CC), a giant snapshot of the public web. Imagine it as a massive library of everything online.
Filters: They compare training on:
- The raw CC (no filter).
- Several filtered versions that remove “low-quality” stuff (e.g., non‑English pages, very repetitive text, etc.). Examples include “RefinedWeb” and “DCLM-Baseline.”
Models: They trained different-sized transformers (like Llama-style models), from small (15 million parameters) to larger (up to 7 billion). Think of parameters as the model’s “brain cells.”
Training steps and “epochs”: Steps are how many batches the model sees. An epoch is like rereading the same pile of data once. More epochs = more re-reading.
Compute (FLOPs): The total “amount of math” the training does—like the fuel burned during practice.
What they measured: “Loss” (also called negative log-likelihood). Lower loss means the model is less surprised by text and has learned better. They checked loss on clean, held-out datasets like C4, Fineweb-Edu, and Cosmopedia.
Stress tests with junk:
- Random strings: Completely made-up words like “htb hqovl bwdws…”
- Shuffled documents: Real web pages, but with the words in each document shuffled into a random order.
- These test how robust models are to really messy inputs.

Analogy: Imagine training a chef. Clean data is like perfect recipes. Junk is like scrambled, messy notes. The question is: with enough practice and a skilled chef, can they still learn good cooking from lots of messy notes?

What they found (the main results)

With enough compute, no filter wins:
- For small models or short training, filtered data often performs better (as you’d expect).
- But as models get bigger and train longer, the raw Common Crawl eventually outperforms all the filtered sets they tried. In other words, the “best filter” becomes “no filter” once you’re large and patient enough.
Big models are surprisingly tough:
- Adding lots of junk data did not ruin learning. In some cases (like adding documents with shuffled words), the model eventually did better than the clean-only version, once training was long enough.
- Even when mixing in large amounts of random strings, big models closed the gap with clean-data performance as training continued.
There’s a predictable trade-off:
- The point where raw CC becomes better than filtered data depends on three knobs: data size, model size, and training steps.
- Bigger models need fewer rereads (epochs) for raw data to win; smaller models need more.
Scaling up to the full web:
- Using their measurements, the authors built simple “scaling laws” (rules of thumb) to predict when the full 240 trillion-token CC pool would beat a strong filtered set (RefinedWeb).
- Their estimate: around 1e+30 FLOPs of compute. That’s huge—much more than most training runs today—but not unimaginable in the future.
Important caveats:
- When compute is limited, filtering still helps a lot.
- Truly harmful or misleading data (e.g., confidently wrong facts) can be bad. They didn’t see tons of this in CC, but it’s a risk to watch.
- The very first tokens in a sequence can be hurt by shuffled-word training (a specific type of distribution shift), though this matters less for most real uses where the model reads more than a couple words.

Why this matters: It challenges the common belief that heavy filtering is always necessary. Instead, it supports a “bitter lesson” in AI: simpler approaches that scale (like using more data with fewer rules) can beat carefully hand-crafted filters—if you can afford the compute.

What this could mean going forward

For very large-scale training in the future, teams may save effort on heavy filtering and focus on using more data plus more compute, trusting big models to sort signal from noise.
Today, with smaller budgets, filtering remains valuable and practical.
Safety and quality still matter:
- As the web evolves (including more AI-generated text), careful monitoring will be needed to avoid harmful content and to understand when filtering should still be applied.
Research direction:
- Better ways to predict when “no filter” becomes optimal for a given compute budget.
- Smarter training setups that can reap the benefits of large, messy data while staying safe and factual.

In short: If you have a huge model and lots of training time, reading the whole (messy) internet can eventually beat reading only the “cleanest” parts. But until you can afford that much compute, smart filtering still pays off.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored, framed to guide actionable future research.

Validation at realistic scales: The core claim (no-filter wins with enough compute) is extrapolated from pools ≤10B tokens and models ≤7B parameters; confirm or refute with substantially larger pools (≥100B tokens) and models (≥70B) to bound the true crossing compute.
Projection uncertainty: Crossing-point forecasts (~1e30 FLOPs) rely on quadratic fits over sparse points and regimes with non-monotone high-epoch losses; provide confidence intervals, alternative functional forms, and broader sweeps to reduce extrapolation risk.
Compute-aware policy for finite budgets: Today’s compute (≤1e27 FLOPs) is far below the forecasted crossing; derive practical, compute-conditioned filtering/weighting thresholds that maximize performance under realistic budgets.
Filter search space coverage: Only a handful of heuristic filters (English, repetition, stop-words, RefinedWeb, DCLM-Baseline) were tested; systematically explore threshold landscapes, multi-filter combinations, and stronger content- and structure-aware filters.
Soft weighting vs hard filtering: Evaluate whether learned data weights, mixture sampling, or bandit-style selection can retain the benefits of unfiltered pools while outperforming both pool-only and hard-filter baselines at moderate compute.
Curriculum and ordering effects: Study data curricula, interleaving strategies, and progressive relaxation of filters to reduce the training steps required for the pool to dominate.
Broader downstream evaluation: Move beyond average NLL on C4/Fineweb-Edu/Cosmopedia; assess reasoning (math/code), long-form generation, calibration, multilingual understanding, safety/toxicity, hallucination rates, and post-training outcomes (SFT/RLHF).
Short-context degradation: Quantify user-facing impact where early-token loss matters (autocomplete, short prompts), and evaluate mitigations (e.g., reweighting first-token losses or specialized preambles) when training on shuffled-word–heavy corpora.
Harmful/mislabeled data prevalence: The brief GPT-based audit on MMLU slices is insufficient; build large-scale, audited estimates of non-factual or misleading content in CC and measure its causal effect on factual knowledge and calibration.
Adversarial and structured junk: Go beyond random strings and word shuffling to include SEO spam, templated boilerplate, synthetic clickbait, OCR noise, near-duplicates, machine-translated garbage, Markov-like babble, and adversarial misinformation to locate failure thresholds.
Duplication and heavy epoching: Characterize how document duplication interacts with many epochs (e.g., ≥10–100) for overfitting, memorization, and generalization; compare global vs locality-sensitive dedup at scale.
AI-generated content drift: Quantify how rising synthetic content in newer CC snapshots affects crossing points, downstream safety, and robustness; develop detectors and counterfactual experiments with controlled synthetic fractions.
Multilingual scope: The analysis is English-centric; measure how multilingual data and English-filtering choices affect cross-lingual transfer, contamination of English performance, and the compute required for pool optimality.
Domain composition: Test whether adding domain-diverse “poor-quality” data (code, forums, logs, subtitles) shifts crossing compute and whether some domains remain harmful at any scale.
Architecture generality: Replicate with Mixture-of-Experts and other training paradigms (e.g., retrieval-augmented, long-sequence models) to see if no-filter remains optimal or if certain architectures demand stricter curation.
Context length dependence: With training context set to 1k tokens, evaluate longer contexts (8k–32k+) to see whether document-order distortions and junk interactions change with extended dependencies.
Tokenizer effects: Test sensitivity to tokenizer choice (BPE vs sentencepiece, multilingual vocabularies) on both loss and the utility of shuffled/noisy text.
Optimization and hyperparameter robustness: Report multi-seed variance, alternative schedules (AdamW settings, batch sizes, LR decay), and weight decay sensitivity to ensure the observed crossings are not optimizer artifacts.
Compute accounting fidelity: Replace the 6MN proxy with measured FLOPs and memory/throughput effects (activation checkpointing, sequence length, MoE routing) to refine compute–performance Pareto frontiers and crossing estimates.
Cost-aware end-to-end objectives: Incorporate serving/inference cost and post-training costs (alignment, safety filtering) to test whether “overtraining + no filter” remains optimal for total cost of ownership.
Safety, bias, and PII: Quantify how unfiltered CC impacts toxicity, social bias, and PII memorization under many epochs; benchmark post-training mitigation difficulty and efficacy versus filtered baselines.
Temporal freshness and recency filtering: Assess whether unfiltered pools degrade time-sensitive factuality; measure benefits of recency-weighted sampling or time-aware filters on modern factual tasks.
Active/online data selection: Explore gradient- or influence-based online selectors that downweight harmful or unhelpful examples during training, comparing to static filters and pure pooling.
Theory beyond low-rank factorization: Develop transformer-relevant theory that relates capacity, sample complexity, noise structure, and compute to when “no filter is best,” including conditions on label noise vs covariate shift.
Failure modes at extreme epoching: Investigate why validation losses become non-monotone at very high epochs, whether crossings vanish in some regimes, and how regularization (dropout, Mixout, stochastic depth) alters this behavior.
Interaction with synthetic training data: Empirically test whether synthetic “high-quality” additions shift crossing compute downward (effective tokens) or can strictly dominate low-quality data when used as regularizers or curricula.

View Paper Prompt View All Prompts

Practical Applications

Below is a concise mapping from the paper’s findings to practical, real‑world applications. Each item names the opportunity, who can use it, what tools/workflows might look like, and the assumptions/dependencies to watch.

Immediate Applications

Compute‑aware data curation knobs instead of fixed, “high‑quality only” filters
- What: Replace static, aggressive filters (e.g., English‑only, repetition thresholds) with adjustable thresholds tied to available compute and model size. For mid‑scale runs, relax or disable some heuristic filters to keep more Common Crawl (CC) data.
- Sectors: Software/AI, MLOps, Cloud providers, Academia
- Tools/products/workflows:
- A “filter threshold scheduler” in data pipelines (e.g., Apache Beam/Spark) that automatically loosens fastText language thresholds, repetition limits, and stop‑word gates as FLOP budgets rise.
- Dashboards that surface compute–performance Pareto frontiers and suggest filter settings given model size M and planned tokens N.
- Assumptions/dependencies:
- Dense transformer pretraining (not MoE) and stability under longer training.
- Adequate validation instrumentation (e.g., NLL on C4/FineWeb/Cosmopedia).
- Legal/compliance review for broader web usage.
Leaner data pipelines: prioritize “no‑filter” (plus safety) over heavy quality curation when compute permits
- What: For large enough models and sufficient training steps, training directly on minimally processed CC can outperform filtered datasets; simplify data ingestion by focusing on parsing, deduplication, safety, and decontamination rather than broad “quality” pruning.
- Sectors: Industry LLM teams, Open‑source model builders, Academia
- Tools/products/workflows:
- “Minimal curation” data pipeline templates: HTML parsing, de‑dup, basic PII removal, basic toxicity screens, split management; skip heavy quality‑scoring heuristics.
- Cost calculators comparing curation costs vs. added compute.
- Assumptions/dependencies:
- Compute is not the limiting bottleneck; when compute is tight, filtering still helps.
- Harmful/incorrect content is relatively rare in the corpus used (pre‑2023 CC in the paper); must monitor shift (e.g., AI‑generated content growth).
Controlled “junk” data injection as a regularizer in pretraining
- What: Introduce small proportions of shuffled‑word documents or random strings into pretraining mixtures; at sufficient model sizes/steps, performance matches or exceeds baselines and can regularize training.
- Sectors: Software/AI, Academia
- Tools/products/workflows:
- Data augmenters that generate shuffled‑word variants of web documents at configurable ratios (e.g., +20% to +400%); automated ablations to monitor NLL.
- Assumptions/dependencies:
- Gains materialize primarily at higher compute and larger models; small models may degrade.
- Maintain factual segments for downstream factual tasks; do not shuffle labeled/structured corpora.
Budgeting and planning: compute‑driven dataset choices for small/medium labs
- What: If compute is constrained, keep (or strengthen) filters; if compute allows multi‑epoch training on mid‑size pools and models ≥330M–1B, consider relaxing filters to harvest more tokens.
- Sectors: Academia, Startups, Nonprofits
- Tools/products/workflows:
- “Crossing point” estimator that projects required epochs/steps for raw‑pool to beat filtered variants at a given pool size and model size.
- Assumptions/dependencies:
- Use the paper’s observed interactions: larger models reduce the epochs needed for unfiltered data to win; ensure training doesn’t enter unstable high‑epoch regimes without monitoring.
Re‑prioritize safety and factuality checks over generic “quality” heuristics
- What: Since models tolerate low‑quality/noisy text but are vulnerable to incorrect labels/factual errors, focus data governance on toxicity, PII, deceptive/fabricated content, and known misinformation.
- Sectors: Policy/compliance, Trust & Safety, Healthcare/Finance (regulated)
- Tools/products/workflows:
- Factuality filters/classifiers, adversarial scans for counterfactual assertions, provenance checks, and post‑training alignment pipelines (e.g., RLHF, red‑teaming).
- Assumptions/dependencies:
- Harmful content fraction remains low enough that no‑filter plus safety screens is viable; regulated domains may still need strict domain‑specific filtering.

Long‑Term Applications

Frontier pretraining on full, minimally filtered Common Crawl
- What: For very high compute budgets (projected ~1e30 FLOPs for 240T tokens), training directly on the full CC pool may outperform robustly filtered datasets like RefinedWeb.
- Sectors: Frontier AI labs, Cloud/compute providers
- Tools/products/workflows:
- End‑to‑end “no‑filter” web‑scale data stack focused on scalable parsing, deduplication, safety, and storage; integrated compute planners using token:parameter ratios (e.g., ~600:1) and epoch constraints.
- Assumptions/dependencies:
- Availability of extreme compute and energy; infrastructure for multi‑epoch training at web scale.
- Robust safety, legal, and copyright strategies for large‑scale web content.
Auto‑curation systems that co‑optimize filters with compute and model size
- What: Integrate scaling laws and crossing‑point predictions into AutoML/MLOps, automatically deciding how much to filter given (M, N, FLOPs).
- Sectors: Software/AI, MLOps platforms
- Tools/products/workflows:
- “Curation policy optimizer” that learns a policy over filter knobs conditioned on compute budgets and target metrics; continuous evaluation against NLL/benchmark proxies.
- Assumptions/dependencies:
- Reliable generalization of scaling laws beyond tested ranges and to varied corpora/modalities.
Architecture‑aware training that isolates noise (e.g., MoE layers tuned for noise routing)
- What: Develop model architectures or training objectives that explicitly route or compartmentalize noisy tokens, making “no‑filter” regimes effective at lower compute.
- Sectors: Model research, Hardware–software co‑design
- Tools/products/workflows:
- MoE gating policies or auxiliary losses for “noise channeling”; selective attention masks; adapters specialized for noisy spans.
- Assumptions/dependencies:
- Stability of alternative architectures (MoE) at scale; empirical validation that routing reduces interference without hurting useful signals.
Synthetic–natural mixture design leveraging noise as a regularizer
- What: Purposeful inclusion of low‑structure synthetic text (e.g., shuffled‑word variants) as a training regularizer, combined with curated high‑signal synthetic data.
- Sectors: AI research, Education content generation
- Tools/products/workflows:
- Mixers that target desired unigram/bigram distributions; curriculum schedulers that phase in noise based on training progress.
- Assumptions/dependencies:
- Balance between effective tokens and noise; careful monitoring so noise does not dominate early training or critical domains.
Sector‑specific data governance shifts from “quality” to “factuality/correctness”
- What: In domains like healthcare/finance, deprioritize generic web “quality” heuristics in favor of correctness and provenance checks, while still harvesting broader unlabeled text to expand coverage.
- Sectors: Healthcare, Finance, Legal/Policy
- Tools/products/workflows:
- Domain‑specific fact‑verification pipelines, citation/provenance enforcement, human‑in‑the‑loop audits; post‑training guardrails for safe generation.
- Assumptions/dependencies:
- High risk tolerance is not acceptable; these sectors still require strict filters for correctness and compliance.
Compute and energy planning for web‑scale “no‑filter” regimes
- What: Grid/energy‑aware scheduling and carbon‑aware training for extremely long runs that exploit large unfiltered pools.
- Sectors: Energy, Cloud, Sustainability policy
- Tools/products/workflows:
- Carbon dashboards tied to training schedules; location/time‑of‑use optimization; commitments to renewable capacity aligned with forecasted 1e29–1e30 FLOP runs.
- Assumptions/dependencies:
- Availability of low‑carbon power and capacity; regulatory alignment on large‑scale compute use.
Expanded data access, storage, and legal frameworks for broad web usage
- What: Policies and infrastructure to responsibly host, index, and use large, minimally filtered web corpora.
- Sectors: Policy, Archives/libraries, Cloud storage
- Tools/products/workflows:
- Versioned CC snapshots with metadata/provenance, opt‑out/rights‑management systems, differential privacy for sensitive content.
- Assumptions/dependencies:
- Evolving copyright/licensing norms; public trust and transparency requirements.

Notes across applications:

The benefits of “no filter” grow with model size and training steps; when compute is the bottleneck, conventional filtering remains valuable.
The paper’s positive results rely on relatively low prevalence of actively harmful/misleading data; if future web data shifts (e.g., more AI‑generated or deceptive content), safety and factuality filters become more critical.
Results were obtained with dense transformers and evaluated mainly with validation loss; downstream task gains generally correlate but should be re‑verified per use case.

View Paper Prompt View All Prompts

Glossary

6NM approximation: A rule-of-thumb formula to estimate training compute for LLMs as 6 × (number of parameters) × (number of tokens). "We calculate the compute for a run with the standard 6NM approximation [Kaplan et al., 2020], where N is the number of total training tokens and M is the number of model parameters."
ARC-Easy: A standardized question-answering benchmark designed to test elementary-level science reasoning. "We also provide results on common benchmarks such as ARC-Easy [Clark et al., 2018] and PIQA [Bisk et al., 2019] in Appendix B."
Chinchilla-optimal: A scaling prescription that balances model size and data tokens to minimize loss for a fixed compute budget. "falls short of the Chinchilla-optimal token budget for a 1 trillion parameter model, even after accounting for diminishing returns when epoching"
Common Crawl (CC): A massive web text corpus commonly used for pretraining LLMs. "The standard approach to select pretraining data for LLMs is to filter text from sources like Common Crawl (CC) [Common Crawl, 2024]."
cosine decay learning rate schedule: A learning rate strategy that decays the rate following a cosine curve, often after a warmup. "Each point consists of a separate training run, with its own warmup and cosine decay learning rate schedule."
covariate shift: A distribution shift where input feature distributions change between training and evaluation while the conditional label distribution remains the same. "We hypothesize that LLMs are highly resistant to covariate shifts,"
DCLM-Baseline: A heavily filtered dataset derived from web text with deduplication and quality-based filtering. "DCLM-Baseline. This dataset applies deduplication and quality-based filtering with a fastText classifier to RefinedWeb."
DCLM-Pool: A large pretraining pool constructed from Common Crawl, used as the unfiltered data source in this work. "the 240 trillion token Common Crawl pool from DCLM-Pool may become optimal as soon as 1e+30 FLOPs."
deduplication: The process of removing duplicate documents or text spans to reduce redundancy in training data. "This dataset applies deduplication and quality-based filtering with a fastText classifier to RefinedWeb."
domain adaptation: Techniques for maintaining performance when models are applied to data from different distributions or domains than the training set. "a large body of research has been dedicated to the problem of domain adaptation and learning under distribution shift"
epoching: Reusing or repeating the same dataset across multiple passes during training (i.e., training for multiple epochs). "even after accounting for diminishing returns when epoching"
fastText classifier: A lightweight text classification tool that uses bag-of-ngrams and subword information; here used for language ID and quality filtering. "quality-based filtering with a fastText classifier"
FLOPs: Floating-point operations; a measure of the total computational cost of training a model. "the 240 trillion token Common Crawl pool from DCLM-Pool may become optimal as soon as 1e+30 FLOPs."
Goodhart's law: The principle that optimizing a proxy metric can cause it to stop reflecting the desired objective. "speculating that this follows from Goodhart's law [1984]"
Llama-style dense transformer: A transformer architecture akin to LLaMA, using fully dense (non-expert) layers for all tokens. "Our models are Llama-style dense transformers ranging from 15 million to 7 billion parameters,"
low-rank matrix factorization: A modeling approach that represents data or transformations with matrices of limited rank to capture core structure. "In low-rank matrix factorization-the simplest 1 hidden layer (linear) neural network-we see exactly this behavior at the population level."
Mixture of Experts (MoEs): Architectures that route tokens to specialized expert subnetworks, potentially improving parameter efficiency but adding training instability. "There may be more unstable architectures such as Mixture of Experts models (MoEs), or phenomena in later stages of training,"
negative log-likelihood (NLL): A loss function equivalent to cross-entropy for language modeling, measuring how well the model predicts observed tokens. "Our main metrics of interest are the loss (negative log-likelihood) on various datasets,"
Pareto frontier: The set of choices that are optimal trade-offs between two competing objectives, here compute and performance. "we take the same runs from Figure 1 and derive a compute-performance Pareto frontier."
perplexity-based filters: Data selection rules that use model perplexity as a proxy for quality or difficulty, often to remove noisy text. "using looser perplexity-based filters mitigates data scarcity."
RefinedWeb: A curated web dataset built with cleaning and filtering steps intended to improve quality for LLM pretraining. "Refined Web. This consists of the filters above along with other similar filters, in an attempt to reproduce the RefinedWeb dataset [Penedo et al., 2023]."
scaling laws: Empirical relationships describing how model performance depends on data, model size, and compute. "Muennighoff et al. [2025] derive scaling laws that factor data repetition into the original Chinchilla scaling laws,"
stop words: Common function words (e.g., “the”, “and”) often used as heuristics in filtering or preprocessing. "This filter ensures that a document contains at least 2 occurrences of English stop words"
tokens-per-parameter ratio: A guideline relating total training tokens to the number of (non-embedding) model parameters for efficient training. "specifying a token- to-non-embedding-parameter ratio (600:1, following DeepSeek V4)."
unigram distribution: The frequency distribution over individual words independent of their order. "despite only the unigram distribution of the documents remaining intact."
weight decay: An L2-regularization technique that penalizes large weights to improve generalization during training. "For each of the models, we tune the training step count and weight decay,"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

A Bitter Lesson for Data Filtering

Summary

The Limits of Data Filtering in Large-Scale LLM Pretraining

Introduction

Experimental Methodology and Setup

Key Findings and Numerical Results

1. Unfiltered Data Outperforms Filtered Data at Sufficient Scale

2. Robustness to Low-Quality and Junk Data

3. Scaling Laws and Compute Projections

4. Distribution Shift and Degradation Edge Cases

5. Theoretical Justification

Implications and Contradictory Claims

Practical and Theoretical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (big picture)

What the researchers wanted to know

How they tested it (in everyday terms)

What they found (the main results)

What this could mean going forward

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long‑Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

A Bitter Lesson for Data Filtering

Summary

The Limits of Data Filtering in Large-Scale LLM Pretraining

Introduction

Experimental Methodology and Setup

Key Findings and Numerical Results

1. Unfiltered Data Outperforms Filtered Data at Sufficient Scale

2. Robustness to Low-Quality and Junk Data

3. Scaling Laws and Compute Projections

4. Distribution Shift and Degradation Edge Cases

5. Theoretical Justification

Implications and Contradictory Claims

Practical and Theoretical Implications

Limitations and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What this paper is about (big picture)

What the researchers wanted to know

How they tested it (in everyday terms)

What they found (the main results)

What this could mean going forward

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long‑Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research