Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
112 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

RefinedWeb Dataset: Scalable Web Data for LMs

Updated 27 July 2025
  • RefinedWeb is a large-scale, filtered and deduplicated web text dataset built from Common Crawl to support language model pretraining.
  • Its pipeline employs sequential filtering, content extraction, and multi-level deduplication to remove low-quality and redundant data.
  • Empirical evaluations show models trained on RefinedWeb achieve comparable or superior performance to those using curated corpora.

RefinedWeb is a large-scale, filtered, and deduplicated web text dataset constructed from Common Crawl for training high-capacity LLMs. Distinguished by its focus on comprehensive filtering and deduplication rather than traditional curation from high-quality corpora, RefinedWeb serves as a test case for whether state-of-the-art LLMs can be trained to high performance using processed web data exclusively. The dataset’s pipeline processes trillions of tokens, and its empirical performance on benchmark evaluations matches or surpasses models trained on curated sources, demonstrating new scalability and accessibility paradigms for LLM pretraining (Penedo et al., 2023).

1. Dataset Construction and Processing Pipeline

RefinedWeb is built solely from Common Crawl, ultimately comprising approximately 5 trillion tokens, with a public release of 600 billion tokens. The construction pipeline—denoted MacroData Refinement (MDR)—operates on raw HTML (WARC files) rather than extracted text files (WET).

The pipeline consists of sequential, modular stages:

  • URL filtering and domain blocklisting: A maintained blocklist with ~4.6 million domains, plus a scoring system based on matching “soft,” “hard,” and “strict” violation word-lists, eliminates disallowed or low-value sites. High-quality domains (Wikipedia, arXiv, etc.) are specifically excluded to avoid overlap with curated corpora.
  • Content extraction: The trafilatura library is used for main content extraction, after which regular-expression-based normalization limits excessive newlines and strips long, extraneous URL sequences.
  • Language filtering: Using the fastText classifier (as in CCNet), documents failing to meet a quality threshold (e.g., <0.65) are dropped.
  • Heuristic quality checks: Documents are filtered for repetition (using patterns inspired by Rae et al. 2021), anomalous character or word ratios, and minimum/maximum length heuristics. A line-wise filter removes boilerplate, navigation, counters, or other artifacts; documents with more than 5% lines affected are dropped entirely.

Extensive deduplication is introduced at multiple levels:

  • Fuzzy deduplication: A MinHash (n=5) scheme computes 9,000 hashes (with b=20, r=450) for each document. Documents sharing a hash in any bucket are flagged as potential duplicates—this approximates the Jaccard index, catching even template-based or lightly paraphrased copies.
  • Exact substring deduplication: After MinHash screening, duplicate exact substrings of length ≥50 tokens are excised using a suffix array (Manber & Myers, 1993). The preferred “cut” variant physically removes these spans from text; alternatives (masking, full document removal) are described but not implemented in the final pipeline.
  • URL deduplication: Duplication across multiple Common Crawl dumps is detected and filtered to avoid double-counting.

2. Empirical Performance and Comparison with Curated Corpora

RefinedWeb-trained LLMs are systematically evaluated against models trained on standard curated datasets such as The Pile. Across both small- and large-scale (e.g., 1B and 7B parameter) settings:

  • Zero-shot accuracy: Models pretrained on RefinedWeb outperform or are on par with those trained on The Pile over a battery of aggregate tasks (LAMBADA, PIQA, HellaSwag, Winogrande, “main,” “core,” and “ext” groupings).
  • Perplexity (bits-per-byte): RefinedWeb-based models achieve lower perplexities at equivalent compute budgets.
  • Ablation studies: Each MDR component (URL filtering, content extraction, line-wise and document-wise filtering, deduplication) is shown to independently and cumulatively improve downstream performance.
  • Experiments substituting RefinedWeb for curated data in popular pretraining recipes reveal that models achieve or surpass GPT-3 performance (under similar computational constraints).

These findings contradict the notion that curated, “high-quality” datasets are required for modern LLM pretraining at scale.

3. Scalability, Efficiency, and Industrial Deployment

The RefinedWeb pipeline is engineered for scalability:

  • Distributed processing: Processing is performed on distributed CPU clusters (thousands of vCPUs) and large-memory instances (e.g., AWS x2iedn, with ≥2 TiB RAM), capable of handling all Common Crawl data since 2008.
  • Data volume reduction: Deduplication removes between 45–75% of raw candidate data, enabling the extraction of ≥5T unique tokens while controlling for redundancy.
  • Resource targets: Pipelines are documented to operate at global web scale, optimizing for tens to hundreds of billions of training tokens for extremely LLMs.
  • The release of 600B tokens provides a public subset, but the processing methods demonstrate viability for continuous extension with future Common Crawl dumps.

The design supports both research-grade experimentation and industrial-scale LLM pipeline deployment.

RefinedWeb occupies a specific niche in the landscape of large web-derived training corpora:

  • Versus WanJuan-CC: WanJuan-CC applies more aggressive document removal, discarding up to 90.2% data via LSH deduplication, and integrates model-based safety/quality filtering (covering toxicity, pornography, and PII masking); RefinedWeb’s approach is less aggressive, targeting a larger total token count with somewhat less rigorous removal of unsafe or low-quality data. WanJuan-CC achieves lower safety-AUC metrics and slightly superior performance (e.g., lower PPL, higher LAMBADA and SuperGLUE scores) on standardized validation tests (Qiu et al., 29 Feb 2024).
  • Versus Zyda: Zyda integrates RefinedWeb as one of several constituent datasets but applies cross-dataset, high-aggression deduplication (using LSH on 13-grams, aggressive thresholds) and a unified, manually tuned filtering pipeline, resulting in a 1.3T token dataset with wider coverage and improved performance on various LLMing benchmarks (Tokpanov et al., 4 Jun 2024).
  • Versus FineWeb: FineWeb relies on per-snapshot MinHash deduplication and custom filtering heuristics over 96 Common Crawl snapshots, producing a 15T dataset. FineWeb and its educational subset (FineWeb-Edu) outperform other open datasets in domain-specific benchmarks, indicating the critical impact of detailed filtering and domain-oriented selection (Penedo et al., 25 Jun 2024).
  • Bias assessment: Despite using similar filtering and extraction procedures to C4, DolmaCC, and RedPajama-V2, RefinedWeb maintains a “fingerprint” bias detectable by neural classifiers, traceable to subtle vocabulary, formatting, and content distribution differences. Models trained exclusively on RefinedWeb propagate these patterns into their outputs, which can have implications for generalization and fairness (Mansour et al., 3 Dec 2024).

5. Influence on Model Architectures and Training Methodologies

The large scale and high diversity of RefinedWeb have enabled a diverse array of architectural experimentation:

  • Efficient Transformers: Low-rank feedforward networks (FFNs) trained on RefinedWeb achieve 2.6× speed-up with 32% parameter budgets and maintain comparable perplexity, with steeper loss scaling curves than dense baselines, indicating superior scaling efficiency for structured architectures (Wei et al., 13 Jul 2024).
  • Novel attention mechanisms: Models like PLDR-LLM leverage Power Law Graph Attention and DAG regularization, taking advantage of the clean, ample RefinedWeb token supply to explore non-standard inductive biases and supervision regimes (Gokden, 22 Oct 2024).
  • Scaling law investigation: Training curves for models of varied sizes confirm that the large, unique, and relatively deduplicated content of RefinedWeb is appropriate for investigating data–performance scaling behaviors at high parameter counts (Penedo et al., 2023).

Experiments demonstrate that models trained on RefinedWeb respond consistently to efficiency and regularization strategies, making the dataset well suited for methodological studies.

6. Applications, Accessibility, and Limitations

RefinedWeb has emergent implications:

  • Wide applicability: The dataset supports autoregressive LLMing at scale, zero/few-shot evaluation, and potentially retrieval-augmented pretraining, given its alignment with real-world web distributions (Penedo et al., 2023).
  • Open access: A 600B-token subset is made publicly available, supporting model training, benchmarking, and reproducibility without the need for proprietary curation resources.
  • Safety and bias: While effective as a general corpus, RefinedWeb does not implement the most aggressive safety and PII-masking strategies (unlike datasets such as WanJuan-CC). Models trained solely on this data may reflect and propagate subtle formatting, vocabulary, and topical biases specific to the dataset’s processing pipeline (Mansour et al., 3 Dec 2024). When output safety, domain specificity, or fairness are paramount, careful downstream filtering or the use of alternately filtered datasets is advised.
  • Future prospects: Ongoing work explores multilingual filtering, refined heuristic pipelines, and deduplication strategies (especially relating to repeated-epoch regimes in multi-hundred-billion-token training settings).

RefinedWeb’s methodology demonstrates that, with robust filtering and deduplication, very large–scale LLMing can rely exclusively on processed web data, reducing dependence on curated specialty corpora and enabling routine experimentation at the trillion-token regime.