The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only (2306.01116v1)

Published 1 Jun 2023 in cs.CL and cs.AI

Abstract: LLMs are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters LLMs trained on it.

PDF Abstract

The RefinedWeb Dataset for Falcon LLM

The paper presents the RefinedWeb dataset, a web-only pretraining corpus developed for training Falcon LLM, challenging the prevailing notion that curated corpora are essential for effective LLM training. This work demonstrates that properly filtered and deduplicated web data can produce models with competitive, if not superior, performance to those based on curated datasets like The Pile.

Key Contributions

Introduction of RefinedWeb: The authors propose RefinedWeb, an extensive English pretraining dataset derived solely from web data, showcasing its scale with a total of five trillion tokens.
Outperforming Curated Datasets: Evidence is presented that LLMs trained exclusively on this dataset outperform those trained on curated corpora, as validated by zero-shot benchmarks.
Deduplication and Filtering: The work emphasizes the importance of large-scale deduplication and filtering. Their pipeline, MacroData Refinement (MDR), utilizes multiple stages, including preprocessing, language identification, and both fuzzy and exact deduplication strategies, removing nearly 90% of original web content to produce a high-quality dataset.

Strong Numerical Results

Models pretrained on RefinedWeb demonstrate superior zero-shot performance across various tasks, benchmarking against notable datasets and outperforming models trained on The Pile. The deduplication strategy, combining MinHash and suffix array techniques, effectively reduces redundancy, improving model performance across datasets consistently.

Implications and Directions

The research challenges current approaches to LLM training, suggesting that scalable web-derived datasets can meet or exceed the performance of curated sources. This finding is critical as the demand for training data exponentially grows with larger model requirements.

The practical implication is the potential to streamline the data pipeline, reducing dependence on manually curated datasets, which are labor-intensive and face scalability issues. The theoretical implications question existing beliefs about data quality hierarchies in training LLMs.

Future Developments in AI

Future work could further explore the general applicability of MDR to other data domains and languages, as well as investigate deduplication's role in data efficiency. Additionally, understanding how varying qualities of web data impact emergent model behaviors could refine pretraining processes.

This paper contributes significantly to the dialogue on scaling pretraining data, implying a shift towards leveraging raw web data, with robust cleaning techniques, can be an effective path forward in building large-scale, high-performance LLMs.