The RefinedWeb Dataset for Falcon LLM
The paper presents the RefinedWeb dataset, a web-only pretraining corpus developed for training Falcon LLM, challenging the prevailing notion that curated corpora are essential for effective LLM training. This work demonstrates that properly filtered and deduplicated web data can produce models with competitive, if not superior, performance to those based on curated datasets like The Pile.
Key Contributions
- Introduction of RefinedWeb: The authors propose RefinedWeb, an extensive English pretraining dataset derived solely from web data, showcasing its scale with a total of five trillion tokens.
- Outperforming Curated Datasets: Evidence is presented that LLMs trained exclusively on this dataset outperform those trained on curated corpora, as validated by zero-shot benchmarks.
- Deduplication and Filtering: The work emphasizes the importance of large-scale deduplication and filtering. Their pipeline, MacroData Refinement (MDR), utilizes multiple stages, including preprocessing, language identification, and both fuzzy and exact deduplication strategies, removing nearly 90% of original web content to produce a high-quality dataset.
Strong Numerical Results
Models pretrained on RefinedWeb demonstrate superior zero-shot performance across various tasks, benchmarking against notable datasets and outperforming models trained on The Pile. The deduplication strategy, combining MinHash and suffix array techniques, effectively reduces redundancy, improving model performance across datasets consistently.
Implications and Directions
The research challenges current approaches to LLM training, suggesting that scalable web-derived datasets can meet or exceed the performance of curated sources. This finding is critical as the demand for training data exponentially grows with larger model requirements.
The practical implication is the potential to streamline the data pipeline, reducing dependence on manually curated datasets, which are labor-intensive and face scalability issues. The theoretical implications question existing beliefs about data quality hierarchies in training LLMs.
Future Developments in AI
Future work could further explore the general applicability of MDR to other data domains and languages, as well as investigate deduplication's role in data efficiency. Additionally, understanding how varying qualities of web data impact emergent model behaviors could refine pretraining processes.
This paper contributes significantly to the dialogue on scaling pretraining data, implying a shift towards leveraging raw web data, with robust cleaning techniques, can be an effective path forward in building large-scale, high-performance LLMs.