Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models (2506.04689v1)

Published 5 Jun 2025 in cs.CL and cs.LG

Abstract: Scaling laws predict that the performance of LLMs improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the "data wall" of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the final pre-training set. Experiments at 1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks, compared to training on only filtered web data. Training on the raw-synthetic data mix is also more effective than having access to 2x web data. Through further analysis, we demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded. REWIRE also outperforms related approaches of generating synthetic data, including Wikipedia-style paraphrasing, question-answer synthesizing and knowledge extraction. These results suggest that recycling web texts holds the potential for being a simple and effective approach for scaling pre-training data.

Authors (7)

Thao Nguyen (41 papers)
Yang Li (1142 papers)
Olga Golovneva (17 papers)
Luke Zettlemoyer (225 papers)
Sewoong Oh (128 papers)
Ludwig Schmidt (80 papers)
Xian Li (116 papers)

Summary

This paper introduces REWIRE (REcycling the Web with guIded REwrite), a method to address the "data wall" problem in pre-training LLMs, where the growth of high-quality public text data is not keeping pace with the demand from increasingly larger models. The core idea is to "recycle" web documents that are typically discarded by stringent quality filters by using an LLM to rewrite and improve them.

Problem: LLM performance scales with model and data size. However, high-quality text data is limited, and aggressive filtering often discards up to 99% of crawled web data. This creates a bottleneck for training larger and more capable LLMs.

Proposed Solution: REWIRE

The REWIRE pipeline aims to enhance both the quality and quantity of pre-training data by:

Starting Material: Taking "moderate-quality" web documents (e.g., those that pass initial rule-based filters like RefinedWeb heuristics but might be discarded by stricter model-based filters).
Guided Rewriting: Using a powerful LLM (Llama-3.3-70B-Instruct in this case) to perform chain-of-thought reasoning on the original document. This involves identifying the document's purpose, reasoning about steps to achieve that purpose, and then generating an improved version. The process aims to make the content more coherent, elaborate, and informative.
Quality Control for Synthetic Data: Applying a model-based filter (a fastText classifier trained to distinguish high-quality synthetic/human text from lower-quality rewritten outputs) to the generated rewritten documents. Only the top-scoring rewritten texts are retained.
Data Mixing: Combining these high-quality rewritten texts with high-quality raw web texts (e.g., DCLM-Baseline) to form the final pre-training dataset.

Experimental Setup:

Data Pool: Experiments start with DCLM-RefinedWeb (Common Crawl data with RefinedWeb heuristics and global deduplication).
Model Scales: Experiments were conducted with Llama-2 architecture models at 1B, 3B, and 7B parameter scales.
Training Budget: Followed DCLM token budgets (e.g., 28.8B tokens for 1B model), simulating a "long token horizon" where data is limited and repeated (up to 4 epochs in main experiments, more in appendix).
Evaluation: Performance was measured on MMLU (5-shot) and the CORE benchmark (average over 22 tasks from DataComp-LM).
Baselines:
- Raw text (top 10% DCLM-Baseline, top 20%)
- Rewritten text (top 10%) alone
- PreSelect (another high-quality raw text selection method)
- Nemotron-CC synthetic data variations (HQ diverse QAs, HQ extracted knowledge, MQ Wikipedia rephrasing)

Key Results:

Performance Boost: Mixing high-quality raw texts with REWIRE-generated texts consistently outperformed training on only filtered web data. Improvements on CORE were 1.0, 1.3, and 2.5 percentage points for 1B, 3B, and 7B models, respectively. MMLU performance also saw significant gains.
Equivalent to More Raw Data: The combination of raw and REWIRE texts achieved performance comparable to, or even exceeding, training on twice the amount of high-quality raw web documents. This suggests REWIRE can effectively "double" the token yield from a limited pool.
Outperforms Other Synthetic Methods: REWIRE, when mixed with raw text, generally yielded better average performance on CORE compared to mixing raw text with Nemotron-CC's synthetic data variants (extracted knowledge, Wikipedia rephrasing). Nemotron-CC's diverse QAs were particularly effective for MMLU, likely due to format alignment.
Recycling Low-Quality Data: About 81.7% of the documents in the REWIRE-generated set (top 10%) originated from raw documents that would have been discarded by the raw text (top 10%) filter. There was little correlation (Spearman correlation = 0.179) between the quality score of an original document and its rewritten version, indicating REWIRE effectively transforms lower-quality documents into high-quality training material.

Rewriting Quality Analysis:

Semantic Similarity: REWIRE-generated texts maintain semantic similarity with the original documents but also introduce modifications and new information, unlike simple paraphrasing. Cosine similarity analysis showed REWIRE texts are more different from originals than Wikipedia-style rephrasing.
Text Diversity:
- N-gram diversity: REWIRE texts were more diverse than Nemotron-CC's extracted knowledge and diverse QAs, and comparable to Wikipedia rephrasing, though still less diverse than raw web text.
- Embedding visualization (t-SNE): REWIRE texts formed a distinct cluster from raw texts and Wikipedia rephrasing, suggesting they increase data coverage.

Practical Implications and Implementation Details:

Prompting Strategy: The guided rewriting uses a detailed prompt instructing the LLM to perform meta-reasoning (understand task, break down problem, plan solution) before generating the improved text. The full prompt is provided in the appendix.
Filtering Synthetic Data: A crucial step is filtering the rewritten documents. The paper trains a fastText classifier specifically for this, using positive examples like OpenHermes 2.5 and ELI5 subreddit posts, and negative examples from random REWIRE generations. Aggressively filtering (e.g., top 10%) is important.
Computational Cost: Generating REWIRE data is computationally intensive due to the use of a large LLM (Llama-3.3-70B-Instruct). Generating 100B tokens took approximately 88K H100 GPU hours. The authors argue this cost can be amortized by using the data for multiple models and epochs.
Mixing Ratio: While a 1:1 mix of raw and rewritten data was effective, tuning the mixing ratio can be important, especially at larger scales (e.g., 60% raw, 40% rewritten showed better CORE at 3B scale in one experiment).
Truthfulness: Evaluations on TruthfulQA and DCLM World Knowledge tasks showed that adding REWIRE data improved, rather than harmed, the resulting model's truthfulness and knowledge capabilities.

Limitations:

Generation Cost: High cost due to using a large LLM for rewriting.
Hallucination Risk: Allowing the LLM to modify content introduces a risk of hallucination, though current evaluations are positive. Future work could include more verification filters.

Future Work:

Exploring different filtering strategies for rewritten data (e.g., diversity-aware sampling).
Developing methods to select synthetic data complementary to existing training sets.
Extending REWIRE with fine-grained controls to promote diversity in text dimensions (style, format, skills).

Conclusion:

REWIRE presents a simple and effective method to enhance pre-training data by recycling and improving lower-quality web documents. It helps address the "data wall" by generating high-quality synthetic data that complements existing raw data, leading to improved LLM performance. The method is general-purpose and complementary to domain-specific synthetic data generation techniques.

PDF Markdown

Tweets

https://twitter.com/thao_nguyen26/status/1937210428876292457

https://twitter.com/thao_nguyen26/status/1937210444751732843

https://twitter.com/TeachableAI/status/1938306571076846062

Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models (2506.04689v1)

Summary

Related Papers

Tweets