A Bitter Lesson for Data Filtering
Abstract: We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (big picture)
This paper asks a simple question with a surprising answer: Do we really need to clean (filter) the internet data we use to train very LLMs? The authors find that, if your model is big enough and you train it long enough, using almost everything—messy web data included—can beat carefully cleaned datasets.
Think of it like learning a language by reading the whole library, messy magazines and all, versus only reading the “best” books. The result here says: if you’re a very strong learner with lots of study time, reading the whole library eventually wins.
What the researchers wanted to know
They focused on a few plain questions:
- Is filtering web data actually necessary when you have huge models and lots of compute (computer power)?
- Could using the full, messy web (Common Crawl) outperform popular filtered datasets?
- How much “junk” can a model handle—like random strings or documents with words shuffled out of order?
- How do model size, training time, and data size trade off with each other?
- Can we predict when “no filtering” becomes the best strategy?
How they tested it (in everyday terms)
Here’s the setup, translated into simple ideas:
- Data “pool”: They use Common Crawl (CC), a giant snapshot of the public web. Imagine it as a massive library of everything online.
- Filters: They compare training on:
- The raw CC (no filter).
- Several filtered versions that remove “low-quality” stuff (e.g., non‑English pages, very repetitive text, etc.). Examples include “RefinedWeb” and “DCLM-Baseline.”
- Models: They trained different-sized transformers (like Llama-style models), from small (15 million parameters) to larger (up to 7 billion). Think of parameters as the model’s “brain cells.”
- Training steps and “epochs”: Steps are how many batches the model sees. An epoch is like rereading the same pile of data once. More epochs = more re-reading.
- Compute (FLOPs): The total “amount of math” the training does—like the fuel burned during practice.
- What they measured: “Loss” (also called negative log-likelihood). Lower loss means the model is less surprised by text and has learned better. They checked loss on clean, held-out datasets like C4, Fineweb-Edu, and Cosmopedia.
- Stress tests with junk:
- Random strings: Completely made-up words like “htb hqovl bwdws…”
- Shuffled documents: Real web pages, but with the words in each document shuffled into a random order.
- These test how robust models are to really messy inputs.
Analogy: Imagine training a chef. Clean data is like perfect recipes. Junk is like scrambled, messy notes. The question is: with enough practice and a skilled chef, can they still learn good cooking from lots of messy notes?
What they found (the main results)
- With enough compute, no filter wins:
- For small models or short training, filtered data often performs better (as you’d expect).
- But as models get bigger and train longer, the raw Common Crawl eventually outperforms all the filtered sets they tried. In other words, the “best filter” becomes “no filter” once you’re large and patient enough.
- Big models are surprisingly tough:
- Adding lots of junk data did not ruin learning. In some cases (like adding documents with shuffled words), the model eventually did better than the clean-only version, once training was long enough.
- Even when mixing in large amounts of random strings, big models closed the gap with clean-data performance as training continued.
- There’s a predictable trade-off:
- The point where raw CC becomes better than filtered data depends on three knobs: data size, model size, and training steps.
- Bigger models need fewer rereads (epochs) for raw data to win; smaller models need more.
- Scaling up to the full web:
- Using their measurements, the authors built simple “scaling laws” (rules of thumb) to predict when the full 240 trillion-token CC pool would beat a strong filtered set (RefinedWeb).
- Their estimate: around 1e+30 FLOPs of compute. That’s huge—much more than most training runs today—but not unimaginable in the future.
- Important caveats:
- When compute is limited, filtering still helps a lot.
- Truly harmful or misleading data (e.g., confidently wrong facts) can be bad. They didn’t see tons of this in CC, but it’s a risk to watch.
- The very first tokens in a sequence can be hurt by shuffled-word training (a specific type of distribution shift), though this matters less for most real uses where the model reads more than a couple words.
Why this matters: It challenges the common belief that heavy filtering is always necessary. Instead, it supports a “bitter lesson” in AI: simpler approaches that scale (like using more data with fewer rules) can beat carefully hand-crafted filters—if you can afford the compute.
What this could mean going forward
- For very large-scale training in the future, teams may save effort on heavy filtering and focus on using more data plus more compute, trusting big models to sort signal from noise.
- Today, with smaller budgets, filtering remains valuable and practical.
- Safety and quality still matter:
- As the web evolves (including more AI-generated text), careful monitoring will be needed to avoid harmful content and to understand when filtering should still be applied.
- Research direction:
- Better ways to predict when “no filter” becomes optimal for a given compute budget.
- Smarter training setups that can reap the benefits of large, messy data while staying safe and factual.
In short: If you have a huge model and lots of training time, reading the whole (messy) internet can eventually beat reading only the “cleanest” parts. But until you can afford that much compute, smart filtering still pays off.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of what remains missing, uncertain, or unexplored, framed to guide actionable future research.
- Validation at realistic scales: The core claim (no-filter wins with enough compute) is extrapolated from pools ≤10B tokens and models ≤7B parameters; confirm or refute with substantially larger pools (≥100B tokens) and models (≥70B) to bound the true crossing compute.
- Projection uncertainty: Crossing-point forecasts (~1e30 FLOPs) rely on quadratic fits over sparse points and regimes with non-monotone high-epoch losses; provide confidence intervals, alternative functional forms, and broader sweeps to reduce extrapolation risk.
- Compute-aware policy for finite budgets: Today’s compute (≤1e27 FLOPs) is far below the forecasted crossing; derive practical, compute-conditioned filtering/weighting thresholds that maximize performance under realistic budgets.
- Filter search space coverage: Only a handful of heuristic filters (English, repetition, stop-words, RefinedWeb, DCLM-Baseline) were tested; systematically explore threshold landscapes, multi-filter combinations, and stronger content- and structure-aware filters.
- Soft weighting vs hard filtering: Evaluate whether learned data weights, mixture sampling, or bandit-style selection can retain the benefits of unfiltered pools while outperforming both pool-only and hard-filter baselines at moderate compute.
- Curriculum and ordering effects: Study data curricula, interleaving strategies, and progressive relaxation of filters to reduce the training steps required for the pool to dominate.
- Broader downstream evaluation: Move beyond average NLL on C4/Fineweb-Edu/Cosmopedia; assess reasoning (math/code), long-form generation, calibration, multilingual understanding, safety/toxicity, hallucination rates, and post-training outcomes (SFT/RLHF).
- Short-context degradation: Quantify user-facing impact where early-token loss matters (autocomplete, short prompts), and evaluate mitigations (e.g., reweighting first-token losses or specialized preambles) when training on shuffled-word–heavy corpora.
- Harmful/mislabeled data prevalence: The brief GPT-based audit on MMLU slices is insufficient; build large-scale, audited estimates of non-factual or misleading content in CC and measure its causal effect on factual knowledge and calibration.
- Adversarial and structured junk: Go beyond random strings and word shuffling to include SEO spam, templated boilerplate, synthetic clickbait, OCR noise, near-duplicates, machine-translated garbage, Markov-like babble, and adversarial misinformation to locate failure thresholds.
- Duplication and heavy epoching: Characterize how document duplication interacts with many epochs (e.g., ≥10–100) for overfitting, memorization, and generalization; compare global vs locality-sensitive dedup at scale.
- AI-generated content drift: Quantify how rising synthetic content in newer CC snapshots affects crossing points, downstream safety, and robustness; develop detectors and counterfactual experiments with controlled synthetic fractions.
- Multilingual scope: The analysis is English-centric; measure how multilingual data and English-filtering choices affect cross-lingual transfer, contamination of English performance, and the compute required for pool optimality.
- Domain composition: Test whether adding domain-diverse “poor-quality” data (code, forums, logs, subtitles) shifts crossing compute and whether some domains remain harmful at any scale.
- Architecture generality: Replicate with Mixture-of-Experts and other training paradigms (e.g., retrieval-augmented, long-sequence models) to see if no-filter remains optimal or if certain architectures demand stricter curation.
- Context length dependence: With training context set to 1k tokens, evaluate longer contexts (8k–32k+) to see whether document-order distortions and junk interactions change with extended dependencies.
- Tokenizer effects: Test sensitivity to tokenizer choice (BPE vs sentencepiece, multilingual vocabularies) on both loss and the utility of shuffled/noisy text.
- Optimization and hyperparameter robustness: Report multi-seed variance, alternative schedules (AdamW settings, batch sizes, LR decay), and weight decay sensitivity to ensure the observed crossings are not optimizer artifacts.
- Compute accounting fidelity: Replace the 6MN proxy with measured FLOPs and memory/throughput effects (activation checkpointing, sequence length, MoE routing) to refine compute–performance Pareto frontiers and crossing estimates.
- Cost-aware end-to-end objectives: Incorporate serving/inference cost and post-training costs (alignment, safety filtering) to test whether “overtraining + no filter” remains optimal for total cost of ownership.
- Safety, bias, and PII: Quantify how unfiltered CC impacts toxicity, social bias, and PII memorization under many epochs; benchmark post-training mitigation difficulty and efficacy versus filtered baselines.
- Temporal freshness and recency filtering: Assess whether unfiltered pools degrade time-sensitive factuality; measure benefits of recency-weighted sampling or time-aware filters on modern factual tasks.
- Active/online data selection: Explore gradient- or influence-based online selectors that downweight harmful or unhelpful examples during training, comparing to static filters and pure pooling.
- Theory beyond low-rank factorization: Develop transformer-relevant theory that relates capacity, sample complexity, noise structure, and compute to when “no filter is best,” including conditions on label noise vs covariate shift.
- Failure modes at extreme epoching: Investigate why validation losses become non-monotone at very high epochs, whether crossings vanish in some regimes, and how regularization (dropout, Mixout, stochastic depth) alters this behavior.
- Interaction with synthetic training data: Empirically test whether synthetic “high-quality” additions shift crossing compute downward (effective tokens) or can strictly dominate low-quality data when used as regularizers or curricula.
Practical Applications
Below is a concise mapping from the paper’s findings to practical, real‑world applications. Each item names the opportunity, who can use it, what tools/workflows might look like, and the assumptions/dependencies to watch.
Immediate Applications
- Compute‑aware data curation knobs instead of fixed, “high‑quality only” filters
- What: Replace static, aggressive filters (e.g., English‑only, repetition thresholds) with adjustable thresholds tied to available compute and model size. For mid‑scale runs, relax or disable some heuristic filters to keep more Common Crawl (CC) data.
- Sectors: Software/AI, MLOps, Cloud providers, Academia
- Tools/products/workflows:
- A “filter threshold scheduler” in data pipelines (e.g., Apache Beam/Spark) that automatically loosens fastText language thresholds, repetition limits, and stop‑word gates as FLOP budgets rise.
- Dashboards that surface compute–performance Pareto frontiers and suggest filter settings given model size M and planned tokens N.
- Assumptions/dependencies:
- Dense transformer pretraining (not MoE) and stability under longer training.
- Adequate validation instrumentation (e.g., NLL on C4/FineWeb/Cosmopedia).
- Legal/compliance review for broader web usage.
- Leaner data pipelines: prioritize “no‑filter” (plus safety) over heavy quality curation when compute permits
- What: For large enough models and sufficient training steps, training directly on minimally processed CC can outperform filtered datasets; simplify data ingestion by focusing on parsing, deduplication, safety, and decontamination rather than broad “quality” pruning.
- Sectors: Industry LLM teams, Open‑source model builders, Academia
- Tools/products/workflows:
- “Minimal curation” data pipeline templates: HTML parsing, de‑dup, basic PII removal, basic toxicity screens, split management; skip heavy quality‑scoring heuristics.
- Cost calculators comparing curation costs vs. added compute.
- Assumptions/dependencies:
- Compute is not the limiting bottleneck; when compute is tight, filtering still helps.
- Harmful/incorrect content is relatively rare in the corpus used (pre‑2023 CC in the paper); must monitor shift (e.g., AI‑generated content growth).
- Controlled “junk” data injection as a regularizer in pretraining
- What: Introduce small proportions of shuffled‑word documents or random strings into pretraining mixtures; at sufficient model sizes/steps, performance matches or exceeds baselines and can regularize training.
- Sectors: Software/AI, Academia
- Tools/products/workflows:
- Data augmenters that generate shuffled‑word variants of web documents at configurable ratios (e.g., +20% to +400%); automated ablations to monitor NLL.
- Assumptions/dependencies:
- Gains materialize primarily at higher compute and larger models; small models may degrade.
- Maintain factual segments for downstream factual tasks; do not shuffle labeled/structured corpora.
- Budgeting and planning: compute‑driven dataset choices for small/medium labs
- What: If compute is constrained, keep (or strengthen) filters; if compute allows multi‑epoch training on mid‑size pools and models ≥330M–1B, consider relaxing filters to harvest more tokens.
- Sectors: Academia, Startups, Nonprofits
- Tools/products/workflows:
- “Crossing point” estimator that projects required epochs/steps for raw‑pool to beat filtered variants at a given pool size and model size.
- Assumptions/dependencies:
- Use the paper’s observed interactions: larger models reduce the epochs needed for unfiltered data to win; ensure training doesn’t enter unstable high‑epoch regimes without monitoring.
- Re‑prioritize safety and factuality checks over generic “quality” heuristics
- What: Since models tolerate low‑quality/noisy text but are vulnerable to incorrect labels/factual errors, focus data governance on toxicity, PII, deceptive/fabricated content, and known misinformation.
- Sectors: Policy/compliance, Trust & Safety, Healthcare/Finance (regulated)
- Tools/products/workflows:
- Factuality filters/classifiers, adversarial scans for counterfactual assertions, provenance checks, and post‑training alignment pipelines (e.g., RLHF, red‑teaming).
- Assumptions/dependencies:
- Harmful content fraction remains low enough that no‑filter plus safety screens is viable; regulated domains may still need strict domain‑specific filtering.
Long‑Term Applications
- Frontier pretraining on full, minimally filtered Common Crawl
- What: For very high compute budgets (projected ~1e30 FLOPs for 240T tokens), training directly on the full CC pool may outperform robustly filtered datasets like RefinedWeb.
- Sectors: Frontier AI labs, Cloud/compute providers
- Tools/products/workflows:
- End‑to‑end “no‑filter” web‑scale data stack focused on scalable parsing, deduplication, safety, and storage; integrated compute planners using token:parameter ratios (e.g., ~600:1) and epoch constraints.
- Assumptions/dependencies:
- Availability of extreme compute and energy; infrastructure for multi‑epoch training at web scale.
- Robust safety, legal, and copyright strategies for large‑scale web content.
- Auto‑curation systems that co‑optimize filters with compute and model size
- What: Integrate scaling laws and crossing‑point predictions into AutoML/MLOps, automatically deciding how much to filter given (M, N, FLOPs).
- Sectors: Software/AI, MLOps platforms
- Tools/products/workflows:
- “Curation policy optimizer” that learns a policy over filter knobs conditioned on compute budgets and target metrics; continuous evaluation against NLL/benchmark proxies.
- Assumptions/dependencies:
- Reliable generalization of scaling laws beyond tested ranges and to varied corpora/modalities.
- Architecture‑aware training that isolates noise (e.g., MoE layers tuned for noise routing)
- What: Develop model architectures or training objectives that explicitly route or compartmentalize noisy tokens, making “no‑filter” regimes effective at lower compute.
- Sectors: Model research, Hardware–software co‑design
- Tools/products/workflows:
- MoE gating policies or auxiliary losses for “noise channeling”; selective attention masks; adapters specialized for noisy spans.
- Assumptions/dependencies:
- Stability of alternative architectures (MoE) at scale; empirical validation that routing reduces interference without hurting useful signals.
- Synthetic–natural mixture design leveraging noise as a regularizer
- What: Purposeful inclusion of low‑structure synthetic text (e.g., shuffled‑word variants) as a training regularizer, combined with curated high‑signal synthetic data.
- Sectors: AI research, Education content generation
- Tools/products/workflows:
- Mixers that target desired unigram/bigram distributions; curriculum schedulers that phase in noise based on training progress.
- Assumptions/dependencies:
- Balance between effective tokens and noise; careful monitoring so noise does not dominate early training or critical domains.
- Sector‑specific data governance shifts from “quality” to “factuality/correctness”
- What: In domains like healthcare/finance, deprioritize generic web “quality” heuristics in favor of correctness and provenance checks, while still harvesting broader unlabeled text to expand coverage.
- Sectors: Healthcare, Finance, Legal/Policy
- Tools/products/workflows:
- Domain‑specific fact‑verification pipelines, citation/provenance enforcement, human‑in‑the‑loop audits; post‑training guardrails for safe generation.
- Assumptions/dependencies:
- High risk tolerance is not acceptable; these sectors still require strict filters for correctness and compliance.
- Compute and energy planning for web‑scale “no‑filter” regimes
- What: Grid/energy‑aware scheduling and carbon‑aware training for extremely long runs that exploit large unfiltered pools.
- Sectors: Energy, Cloud, Sustainability policy
- Tools/products/workflows:
- Carbon dashboards tied to training schedules; location/time‑of‑use optimization; commitments to renewable capacity aligned with forecasted 1e29–1e30 FLOP runs.
- Assumptions/dependencies:
- Availability of low‑carbon power and capacity; regulatory alignment on large‑scale compute use.
- Expanded data access, storage, and legal frameworks for broad web usage
- What: Policies and infrastructure to responsibly host, index, and use large, minimally filtered web corpora.
- Sectors: Policy, Archives/libraries, Cloud storage
- Tools/products/workflows:
- Versioned CC snapshots with metadata/provenance, opt‑out/rights‑management systems, differential privacy for sensitive content.
- Assumptions/dependencies:
- Evolving copyright/licensing norms; public trust and transparency requirements.
Notes across applications:
- The benefits of “no filter” grow with model size and training steps; when compute is the bottleneck, conventional filtering remains valuable.
- The paper’s positive results rely on relatively low prevalence of actively harmful/misleading data; if future web data shifts (e.g., more AI‑generated or deceptive content), safety and factuality filters become more critical.
- Results were obtained with dense transformers and evaluated mainly with validation loss; downstream task gains generally correlate but should be re‑verified per use case.
Glossary
- 6NM approximation: A rule-of-thumb formula to estimate training compute for LLMs as 6 × (number of parameters) × (number of tokens). "We calculate the compute for a run with the standard 6NM approximation [Kaplan et al., 2020], where N is the number of total training tokens and M is the number of model parameters."
- ARC-Easy: A standardized question-answering benchmark designed to test elementary-level science reasoning. "We also provide results on common benchmarks such as ARC-Easy [Clark et al., 2018] and PIQA [Bisk et al., 2019] in Appendix B."
- Chinchilla-optimal: A scaling prescription that balances model size and data tokens to minimize loss for a fixed compute budget. "falls short of the Chinchilla-optimal token budget for a 1 trillion parameter model, even after accounting for diminishing returns when epoching"
- Common Crawl (CC): A massive web text corpus commonly used for pretraining LLMs. "The standard approach to select pretraining data for LLMs is to filter text from sources like Common Crawl (CC) [Common Crawl, 2024]."
- cosine decay learning rate schedule: A learning rate strategy that decays the rate following a cosine curve, often after a warmup. "Each point consists of a separate training run, with its own warmup and cosine decay learning rate schedule."
- covariate shift: A distribution shift where input feature distributions change between training and evaluation while the conditional label distribution remains the same. "We hypothesize that LLMs are highly resistant to covariate shifts,"
- DCLM-Baseline: A heavily filtered dataset derived from web text with deduplication and quality-based filtering. "DCLM-Baseline. This dataset applies deduplication and quality-based filtering with a fastText classifier to RefinedWeb."
- DCLM-Pool: A large pretraining pool constructed from Common Crawl, used as the unfiltered data source in this work. "the 240 trillion token Common Crawl pool from DCLM-Pool may become optimal as soon as 1e+30 FLOPs."
- deduplication: The process of removing duplicate documents or text spans to reduce redundancy in training data. "This dataset applies deduplication and quality-based filtering with a fastText classifier to RefinedWeb."
- domain adaptation: Techniques for maintaining performance when models are applied to data from different distributions or domains than the training set. "a large body of research has been dedicated to the problem of domain adaptation and learning under distribution shift"
- epoching: Reusing or repeating the same dataset across multiple passes during training (i.e., training for multiple epochs). "even after accounting for diminishing returns when epoching"
- fastText classifier: A lightweight text classification tool that uses bag-of-ngrams and subword information; here used for language ID and quality filtering. "quality-based filtering with a fastText classifier"
- FLOPs: Floating-point operations; a measure of the total computational cost of training a model. "the 240 trillion token Common Crawl pool from DCLM-Pool may become optimal as soon as 1e+30 FLOPs."
- Goodhart's law: The principle that optimizing a proxy metric can cause it to stop reflecting the desired objective. "speculating that this follows from Goodhart's law [1984]"
- Llama-style dense transformer: A transformer architecture akin to LLaMA, using fully dense (non-expert) layers for all tokens. "Our models are Llama-style dense transformers ranging from 15 million to 7 billion parameters,"
- low-rank matrix factorization: A modeling approach that represents data or transformations with matrices of limited rank to capture core structure. "In low-rank matrix factorization-the simplest 1 hidden layer (linear) neural network-we see exactly this behavior at the population level."
- Mixture of Experts (MoEs): Architectures that route tokens to specialized expert subnetworks, potentially improving parameter efficiency but adding training instability. "There may be more unstable architectures such as Mixture of Experts models (MoEs), or phenomena in later stages of training,"
- negative log-likelihood (NLL): A loss function equivalent to cross-entropy for language modeling, measuring how well the model predicts observed tokens. "Our main metrics of interest are the loss (negative log-likelihood) on various datasets,"
- Pareto frontier: The set of choices that are optimal trade-offs between two competing objectives, here compute and performance. "we take the same runs from Figure 1 and derive a compute-performance Pareto frontier."
- perplexity-based filters: Data selection rules that use model perplexity as a proxy for quality or difficulty, often to remove noisy text. "using looser perplexity-based filters mitigates data scarcity."
- RefinedWeb: A curated web dataset built with cleaning and filtering steps intended to improve quality for LLM pretraining. "Refined Web. This consists of the filters above along with other similar filters, in an attempt to reproduce the RefinedWeb dataset [Penedo et al., 2023]."
- scaling laws: Empirical relationships describing how model performance depends on data, model size, and compute. "Muennighoff et al. [2025] derive scaling laws that factor data repetition into the original Chinchilla scaling laws,"
- stop words: Common function words (e.g., “the”, “and”) often used as heuristics in filtering or preprocessing. "This filter ensures that a document contains at least 2 occurrences of English stop words"
- tokens-per-parameter ratio: A guideline relating total training tokens to the number of (non-embedding) model parameters for efficient training. "specifying a token- to-non-embedding-parameter ratio (600:1, following DeepSeek V4)."
- unigram distribution: The frequency distribution over individual words independent of their order. "despite only the unigram distribution of the documents remaining intact."
- weight decay: An L2-regularization technique that penalizes large weights to improve generalization during training. "For each of the models, we tune the training step count and weight decay,"
Collections
Sign up for free to add this paper to one or more collections.