Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FineWeb2: Multilingual Pretraining Pipeline for LLMs

Updated 27 June 2025

FineWeb2 is a multilingual pretraining data pipeline and public dataset designed to provide high-quality, scalable text corpora for LLM training across over one thousand languages. Developed as an evolution of earlier web-scale filtering efforts, FineWeb2 addresses the key challenges of language-specific adaptation, deduplication, filtering, and corpus rebalancing in a single modular framework, and demonstrates measurable improvements over prior state-of-the-art datasets. Its open-source pipeline, large data release, and principled evaluation suite advance best practices in multilingual dataset construction for LLM research and deployment.

1. Pipeline Architecture and Language Adaptation

FineWeb2 introduces an adaptable and language-aware pipeline that processes raw Common Crawl web data into high-quality, ready-to-train corpora. The pipeline includes the following main stages:

  1. Data Extraction: Raw WARC files are downloaded from Common Crawl. Main textual content is extracted using the trafilatura library, with blocklist-based removal of unwanted domains (e.g., adult content).
  2. Language Identification (LID): Uses GlotLID (covering ~1,900 languages and script variants) for precise document-level language detection. For each language ll, a confidence threshold τl\tau_l is set:

τl=max{0.3,min{0.9,Med(X)σ(X)}}\tau_l = \max\{0.3, \min\{0.9, \mathrm{Med}(X) - \sigma(X)\}\}

where XX is the confidence distribution, Med\mathrm{Med} the median, and σ\sigma the standard deviation. This adaptive thresholding accommodates diverse language resource levels and misclassification risks.

  1. Deduplication: MinHash deduplication is performed globally per language, using word-level 5-grams suited to each language’s tokenization or segmentation. The cluster size (deduplication multiplicity) of each document is recorded for later rebalancing.
  2. Tokenization (Word Segmentation): Employs SpaCy, Stanza, and language-specific libraries (IndicNLP, PyThaiNLP, Kiwipiepy, etc.) for accurate segmentation. For languages lacking direct support, family-based proxy tokenization is applied.
  3. Quality Filtering: Multiple heuristic filters—such as document/word length, symbol ratios, repetitions, and n-gram duplication—are automatically adapted to each language. Thresholds are derived from statistics in Wikipedia or other clean reference corpora. Filtering strategies include 10Tail (bottom 10% by metric), mean-std quantile, and median-ratio relative to English reference.
  4. Enhanced Filtering for Low-Resource Languages: To address high false positives, precision-based document filtering is added using language-specific high-affinity word lists and whitelisting of trusted URLs.
  5. Rehydration (Dedup-Informed Upsampling): A deduplication- and filter-aware upsampling procedure that rebalances according to empirical cluster-size/quality distributions (see Section 4).

Each component is designed to be parameterized per language, allowing the pipeline to automatically adapt across a spectrum from high-resource to severely under-resourced languages, as evidenced by observed improvements in downstream evaluation (see Section 3).

2. Evaluation Methodology and Task Selection

A distinctive contribution of FineWeb2 is its evaluation pipeline for task/benchmark selection and ablation. To ensure reliable early signals for LLM training quality, the following process is implemented:

  1. Task Collection: A broad collection of translated or multilingual natural language tasks (MMLU, HellaSwag, MLQA, X-CSQA, X-CODAH, FQuAD, etc.) is curated for each of nine “canary” languages representing diverse families and resource levels.
  2. Quantitative Selection Criteria:
    • Monotonicity: Average Spearman correlation ρˉ0.5\bar{\rho} \geq 0.5 between task score and training step.
    • Signal-to-Noise Ratio: For each ss,

    SNR=1ns=0nμsσs\mathrm{SNR} = \frac{1}{n} \sum_{s=0}^n \frac{\mu^*_s}{\sigma_s}

    Tasks with SNR 20\geq 20 are retained. - Non-Randomness: Ratio of improvement over random baseline to final standard deviation, with 3\geq 3 as cutoff. - Ordering Consistency: Kendall's Tau-aa (optional).

  3. Task Categorization: Balanced coverage across Reading Comprehension (RC), General Knowledge (GK), Natural Language Understanding (NLU), and Common-Sense Reasoning (CR) is enforced.

  4. Aggregation: Macro-averaging across categories, with raw scores rescaled:

new_score=scoreb1b\mathrm{new\_score} = \frac{\mathrm{score} - b}{1 - b}

where bb is the random baseline for the task.

This results in reliable, noise-robust, and comparable evaluations across languages and pipeline stages.

3. Performance Studies and Observed Improvements

Incremental improvements are measured at each pipeline stage (LID, deduplication, filtering, rehydration), in a series of LLM ablations on the “canary” languages:

  • Aggregate metrics: Macro-averaged, rescaled accuracy/F1 over the selected suite.

| Language | LID only | +Dedup | +Filtering | +Rehydration (Final) | |----------|----------|--------|------------|----------------------| | Arabic | 21.7 | 22.1 | 24.2 | 25.2 | | French | 18.3 | 18.0 | 21.9 | 23.6 | | ... | ... | ... | ... | ... |

  • Models trained on FineWeb2 consistently surpass those trained on CC-100, mC4, CulturaX, HPLT, and other prior multilingual corpora, on both pipeline-optimized (“canary”) and held-out languages. For example, on 11/14 tested unseen languages (including German, Indonesian, Japanese, Vietnamese), FineWeb2 trained models deliver superior downstream scores.

  • The pipeline demonstrates robustness, with improvements holding for both high- and low-resource languages and across domains (Wikipedia vs. web).

4. Deduplication-Informed Rebalancing ("Rehydration")

FineWeb2 introduces a principled, dataset- and language-agnostic rehydration method to address quality imbalances after deduplication:

  • Each kept document is assigned a deduplication cluster count (cc).

  • For each cc, the downstream filtering removal rate rcr_c is computed; rglobalr_{global} is the global removal rate.

  • Upsampling weights wcw_c are defined:

    • wc=10w_c=10 for the cluster size with the lowest removal rate (indicative of high quality).
    • wc=1w_c=1 for cluster sizes with removal rate >rglobal> r_{global} (no upsampling).
    • Linear interpolation for intermediate cc values.

This empirical approach is based on the observation that the highest-quality documents mostly populate the “middle” of deduplication cluster sizes (i.e., neither unique nor massively duplicated). This avoids biasing the corpus toward low-quality unique content or boilerplate-heavy duplicates and provides consistent improvement in evaluation metrics.

5. Multilingual Coverage, Dataset Release, and Tooling

  • Scale and Scope: FineWeb2 covers 1,868 language–script pairs, with substantial corpora for over 1,000 languages. The dataset is 20 TB in size, comprising over 5 billion documents drawn from 96 Common Crawl snapshots (2013–2024).
  • Per-Language Adaptation: All pipeline parameters—LID thresholds, segmentation, filtering—are set independently per language/script and, where needed, per domain/resource.
  • Low-Resource Handling: For very low-resource languages, the pipeline uses precision-enhancing heuristics (e.g., high-affinity word inclusion, URL white/blacklisting), and tracks corpus composition (e.g., Wikipedia/Bible dominance).
  • Public Access: The complete dataset, curation pipeline, and evaluation code are open-source:

Researchers can directly reuse, retrain, or extend the pipeline to custom languages or adapt filtering/deduplication as desired.

6. Context, Limitations, and Comparative Position

Within the landscape of multilingual LLM data pipelines, FineWeb2 distinguishes itself through fully adaptive, per-language curation, deduplication-informed rebalancing, and comprehensive, quantitative ablation studies. Previous efforts (e.g., CC-100, mC4) lack systematic adaptation for low- and medium-resource languages, and often apply filtering and deduplication in a one-size-fits-all manner. FineWeb2’s methodologies address these deficiencies, as reflected in improved model performance across benchmarks.

Recent research identifies that heuristic filtering approaches such as those in FineWeb2, while robust and scalable, may be further exceeded by model-based semantic filtering (as in JQL (Ali et al., 28 May 2025 )), particularly for subtle crosslingual quality cues or high-level semantic criteria. Nevertheless, FineWeb2 sets a new baseline for scale, practicality, and performance in open multilingual LLM pretraining data.

7. Implications for Multilingual LLM Development

FineWeb2 provides an open, reproducible foundation for training large-scale multilingual LLMs. By making all pipeline steps and design choices transparent and adaptive, it enables:

  • High-quality, broad-coverage multilingual datasets for new and under-resourced languages.
  • Systematic evaluation of design trade-offs via robust early-signal tasks and metrics.
  • Research agility for developing new filtering/deduplication techniques and domain extensions.
  • Standardization and reproducibility in multilingual model training and benchmarking, lowering barriers for non-corporate and academic groups.

FineWeb2’s adaptive, modular methodology supports flexible scaling, robust performance, and open contributions, enabling future advances in multilingual natural language understanding and generation.