Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FineWeb: A Large-Scale Web-Derived Corpus for LLM Pretraining

Updated 27 June 2025

FineWeb is a public, large-scale web-derived text corpus and associated data curation framework engineered to advance the quality and transparency of LLM pretraining datasets. Developed to address the opacity and inaccessibility of the data used for leading closed-source LLMs, FineWeb distinguishes itself through its methodical, empirically validated pipeline—for extraction, filtering, deduplication, annotation, and benchmarking—applied at unprecedented scale (15 trillion tokens) and with rigorous ablation studies. Its influence extends across dataset design, model pretraining practices, bias measurement, multilingual and domain-specific adaptation, line-level annotation, and open completions for domain-specialized and retrieval-augmented systems.

1. Origin, Scope, and Dataset Structure

FineWeb is constructed from 96 snapshots of Common Crawl, spanning from 2013 to early 2024, and comprises 15 trillion GPT-2 tokens of predominantly English text (with derived subsets supporting multilingual and Chinese corpora). Each record in the dataset is an extracted web page segment with rich metadata: main text, unique ID, crawl snapshot details, URL, crawl date, language score, and token count. FineWeb’s pipeline emphasizes the use of raw WARC files (as opposed to pre-stripped WET files) and leverages Trafilatura for main content extraction to minimize boilerplate and maximize downstream LLM utility.

A prominent feature is the FineWeb-Edu subset: a 1.3 trillion token selection filtered for educational value using a model-based classifier trained on Llama-3-70B-Instruct-generated annotations. The dataset and all processing scripts, benchmark splits, and annotation protocols are openly released under permissive ODC-By licensing, as are ablation models and benchmarking protocols.

2. Design Principles: Extraction, Filtering, Deduplication

FineWeb’s curation pipeline proceeds in carefully sequenced stages:

  • Text Extraction: Trafilatura parses raw WARC files for main content, empirically shown to outperform alternatives by reducing boilerplate and increasing LLM benchmark scores.
  • Layered Filtering:
    • Base filters include a URL blacklist (for adult/NSFW domains), fastText-based language ID with a P(English)0.65P(\mathrm{English}) \geq 0.65 threshold, and repetition/quality metrics (from MassiveText).
    • Additional filters adapt best practices from C4 (e.g., removal of boilerplate phrases, but the terminal punctuation rule was dropped as too aggressive).
    • >50 custom metric-threshold pairs, ablated for effect, where, for example, documents with low terminal punctuation frequency, high duplicate line fraction (≥0.1), or a high proportion (0.67\geq0.67) of short lines (<30<30 chars) are removed.
  • Deduplication: Rather than global deduplication (which causes data quality regression by discarding older, higher-quality content), FineWeb deduplicates within each crawl snapshot using MinHash over 5-grams with a 75% similarity threshold. The probabilistic match formula is given by

Pmatch=1(1s8)14P_{match} = 1 - (1 - s^8)^{14}

where ss is n-gram similarity.

  • PII Masking: Personal information such as emails and IP addresses is scrubbed via regex.

Each curation stage is empirically justified to produce monotonic improvements on LLM benchmarks, as shown in ablation results.

3. Empirical Performance and Benchmarking

FineWeb sets a new performance standard for open LLM pretraining datasets:

  • LLMs trained on FineWeb outperform those trained on RefinedWeb, C4, Dolma, RedPajama2, and others, when evaluated with the same architecture (1.82B parameters) and token counts.
  • Benchmarks employed: CommonSense QA, HellaSwag, OpenBook QA, PIQA, SIQA, WinoGrande, ARC, and MMLU. Models are evaluated using the lighteval protocol, ensuring reproducibility.
  • FineWeb-Edu shows marked performance gains on knowledge/reasoning benchmarks: for instance, MMLU accuracy rises from 33% (FineWeb 350B tokens) to 37% (FineWeb-Edu), and ARC from 46% to 57%.

Public release includes all model checkpoints for alternative pipelines—a practice rarely matched for transparency.

4. Educational Filtering and Domain Adaptation

FineWeb underpins new strategies for extracting knowledge-rich and domain-specific subsets:

  • FineWeb-Edu is assembled via a two-stage process: GPT-generated educational scoring (0–5 scale) over 460k samples, linear regression on Snowflake-arctic-embed-m embeddings, and selection at score ≥3 (F1 = 82%). 6,000 H100 GPU hours are committed for inference at scale.
  • Domain adaptation (e.g., OnlySports Dataset): The methodology is replicated for English sports, law, medicine, and astronomy, starting with FineWeb or FineWeb-Edu, filtering with domain-specific lexicons, embedding similarity, or classifiers, followed by high-precision downstream filtering. For instance, ORBIT (astronomy) distills a 10B well-annotated subset for fine-tuning, yielding a +7-point MMLU Astronomy improvement.
  • Multilingual and Cross-lingual Expansion: Paradigms from FineWeb inform the construction of model-based filtering for FineWeb-2, which automatically adapts extraction, deduplication, and threshold assignment to 1,868 language-script pairs, yielding a 20TB (5B document) dataset with competitive or superior downstream performance in non-English LLMs.

FineWeb data is also a principal component in downstream mixtures such as Zyda-2, where upweighting FineWeb-Edu tokens is shown to enhance model quality.

5. Bias, Artefacts, and Benchmarking of Data Filters

Recent analyses reveal that, despite shared origins and broadly similar pipelines, FineWeb possesses a unique “fingerprint”: machine classifiers reliably distinguish its sequences from those of C4, RefinedWeb, DolmaCC, and others, achieving up to 87% accuracy in binary classification and 80% in 3-way splits where chance is 33%. These biases persist after surface rephrasing, formatting removal, or even generation by models pretrained on the datasets. The bias propagates through LLM training, meaning that models inherit distributional artefacts—impacting generalization and mixture estimation. This underscores the need for transparency and holistic documentation in open web pretraining pipelines.

6. Impact on Downstream Agent and Retrieval Systems

FineWeb plays an increasingly central role as a foundation corpus for information retrieval, retrieval-augmented generation (RAG), and open QA systems:

  • DeepResearchGym: Indexes FineWeb (recent CC snapshot) with MiniCPM dense embeddings, sharded DiskANN ANN retrieval, and public RESTful APIs, powering large-scale, low-latency, and fully reproducible IR experiments for complex deep research questions. Benchmarks show parity or superiority to commercial APIs, with improved alignment, faithfulness, and report quality over controlled tasks.
  • RAG Challenge Leaderboards: FineWeb-10BT variants are used as substrates for major open QA competitions (SIGIR LiveRAG), supporting dense, sparse, and hybrid retrieval, cluster-based context organization, and LLM-based evaluators. Systems such as RAGtifier, TopClustRAG, and RAGentA leverage FineWeb passages in advanced multi-stage and multi-agent pipelines, attaining top leaderboard rankings for answer correctness, factual faithfulness, and evidence attribution.
  • Data annotation at scale: Advances such as FinerWeb-10BT introduce LLM-based line-level filtering (via GPT-4o mini and DeBERTa-v3 classifiers), providing granular quality labels and demonstrating that models trained on the cleaned subset reach higher HellaSwag accuracy up to 32% faster, even with 25% less pretraining data.

FineWeb’s design and openness have enabled robust, repeatable, and extensible benchmarking for both IR and foundational LLM model development.

7. Extensions, Recent Innovations, and Open Resources

FineWeb’s influence is further amplified by its extensibility and openness:

  • Used as a seed for machine-translated corpora (TransWebEdu), enabling high-quality LLM pretraining for under-resourced languages and demonstrating state-of-the-art cross-lingual generalization with an order of magnitude less data than closed-source competitors.
  • Underlies efficient data filtering innovations, e.g., Ultra-FineWeb, which integrates rapid verification, empirically driven seed selection, and fastText classification, yielding 1T English and 120B Chinese tokens with further improvements in model performance and rapid curation.
  • Serves as the base for FineWeb2, scaling to 1,000+ languages with language-informed tokenization, adaptive thresholding, and duplication-aware rehydration for optimal dataset balance.
  • All datasets, code, annotation models, and evaluation protocols are openly released, enabling broad adoption and principled critique within the community.

Table: FineWeb and Key Derivative Datasets

Dataset Size/Scope Core Methods Use Cases
FineWeb 15T tokens, en. Empirical extraction, dedup General LLM pretraining; IR
FineWeb-Edu 1.3T tokens, en. LLM-annotated quality Education/reasoning LLMs, domain curation
OnlySports 600B tokens, sports en. Domain classifier, MapReduce Domain-specific LMs
Zyda-2 5T tokens, en.+others Cross-dedup, scoring SOTA Zamba2 model pretraining
Nemotron-CC 6.3T tokens, en. Ensemble, synth. data Long-horizon pretraining
FinerWeb-10BT 10B tokens, en. LLM line-level filtering Efficient model pretraining
FineWeb2 20TB, 5B docs, multi. Adaptive pipeline, upsampl. Multilingual LLMs across 1,000+ languages

References and Resources

FineWeb and its descendants have defined new standards for scale, rigor, and openness in LLM data curation, setting the foundation for reproducible model development, domain and multilingual pretraining, data-centric bias analysis, and robust evaluation of both models and complex agentic systems.