Papers
Topics
Authors
Recent
Search
2000 character limit reached

FinerWeb-10BT: Curated Corpus & RAG Benchmark

Updated 14 May 2026
  • FinerWeb-10BT is a large-scale, annotated web corpus refined via LLM-based line-level filtering to boost data quality for diverse model training and evaluation tasks.
  • It employs both sparse BM25 and dense vector indexing to facilitate hybrid retrieval architectures and efficient RAG system development.
  • The corpus serves as a standardized benchmark underpinning advances in QA accuracy, semantic retrieval, and scalable data-centric innovations.

FinerWeb-10BT is an extensively curated, annotated, and evaluated large-scale web corpus designed for both LLM training data quality research and as a foundational testbed for retrieval-augmented generation (RAG) systems. Derived from the larger FineWeb corpus—a rigorously cleaned, de-duplicated slice of CommonCrawl—the dataset and its subsequent filtered and annotated variants have played a central role in RAG benchmark competitions and data quality studies, supporting methodologically diverse evaluation of retrieval architectures, clustering, neural re-ranking, and prompt optimization workflows. FineWeb-10BT consists of approximately 10–15 million English web documents, segmented into chunks up to 512 tokens and indexed using both sparse and dense retrieval mechanisms. A quality-enhanced variant, known as FinerWeb-10BT, incorporates line-level LLM-based filtering, setting a technical precedent for scalable, semantics-driven data selection. The corpus, its indices, and associated tools form a standardized substrate for system-level advances in QA accuracy, answer faithfulness, and data-centric model innovation.

1. Corpus Construction and Data Quality Annotation

FineWeb-10BT is constructed by sampling either 10 million or 15 million web documents (depending on competition snapshot) from the larger FineWeb dataset, which originates from CommonCrawl. The FineWeb pipeline first applies rigorous HTML cleaning, boilerplate removal, and near-duplicate elimination; no additional post-sampling filtering by language, domain, or toxicity is performed, leaving the corpus broadly heterogeneous and challenging for downstream evaluation (Carmel et al., 7 Jul 2025). Documents span a broad spectrum of domains, with general-web coverage including news, health, technology, entertainment, commerce, and more.

A distinct contribution is FinerWeb-10BT, a quality-annotated subset of 10 billion tokens, generated by applying LLM-based line-level noise detection and removal (Henriksson et al., 13 Jan 2025). In this pipeline, a 20,000-document sample is labeled line-by-line using GPT-4o mini. The model outputs descriptive labels for low-quality content, which are clustered into nine principal categories: Clean, Formatting/Style/Errors, Bibliographical/Citation, Promotional/Spam, Contact Information, Navigation/Interface Elements, Technical Specs/Metadata, Legal/Administrative, and Offensive/Inappropriate. A DeBERTa-v3 classifier, fine-tuned on these annotations, scales this taxonomy to the entire corpus. Filtering at thresholds of quality_score < 0.50 (removing 8% of lines) and < 0.90 (removing 25%) preserves substantial high-quality content and enables demonstrable improvements in training efficiency and downstream model accuracy.

2. Segmentation, Indexing, and Retrieval Infrastructure

For RAG and IR system testing, FineWeb-10BT documents are further split into non-overlapping “chunks” of up to 512 tokens using sentence-aware splitting (e.g., LlamaIndex splitter) (Carmel et al., 7 Jul 2025). Each chunk is indexed in two primary modalities:

  • Sparse Index (OpenSearch BM25):
    • Built over raw or cleaned text using the “english” analyzer for tokenization and lowercasing.
    • Standard BM25 scoring:

    scoreBM25(q,d)=tqIDF(t)  f(t,d)(k1+1)f(t,d)+k1(1b+bd/avgdl)\mathrm{score}_{\mathrm{BM25}}(q,d) = \sum_{t \in q} \mathrm{IDF}(t)\;\frac{f(t,d)\,(k_1+1)}{f(t,d) + k_1 (1 - b + b\,|d|/\mathrm{avgdl})}

    with typical OpenSearch parameters k1=1.2k_1 = 1.2, b=0.75b = 0.75 (but some variants use k1=0.9k_1 = 0.9, b=0.4b = 0.4) (Carmel et al., 7 Jul 2025, Cofala et al., 17 Jun 2025, Fensore et al., 27 Jun 2025).

  • Dense Vector Index (Pinecone):

    • Built using E5-base-v2 or BGE encoders, producing 768-dimensional or similar embeddings, with cosine similarity or inner-product as the search metric.
    • Pinecone’s slab-based system compacts data via IVF-PQ and HNSW structures.

Hybrid retrieval is prevalent: systems typically issue BM25 and dense queries in parallel and merge via Reciprocal Rank Fusion (RRF) or normalized score addition, and may further re-rank candidates using cross-encoder models (e.g., BGE-m3, RankLLaMA, ms-marco-MiniLM-L6-v2) (Bakagianni et al., 18 Jun 2025, Fensore et al., 27 Jun 2025).

3. Line-Level Quality Filtering and Evaluation

The FinerWeb-10BT workflow demonstrates the efficacy of LLM-based, line-level filtering for identifying low-quality web text overlooked by heuristic document-level rules (Henriksson et al., 13 Jan 2025). Starting from a 20,000-document sample, GPT-4o mini yields 547 unique labels for low-quality content, collapsed into a nine-class taxonomy. Clean prose comprises 86% of lines; the major low-quality classes individually range from 0.74% to 4%. A DeBERTa-v3 classifier achieves a Clean-vs-noise F1 of 0.90 on held-out data, with a bias toward mislabeling low-quality lines as Clean in order to minimize the risk of discarding “borderline” good content.

Filtering at a 0.90 threshold removes 25% of tokens, yet training GPT-2 models on reduced datasets yields an absolute +0.1 accuracy gain (~0.31 vs ~0.30) on HellaSwag, and filtered models reach the original's best accuracy ~6000 steps (32%) faster. These results, robust across five training runs, validate the semantics-driven approach as both data- and energy-efficient. The filtering pipeline and taxonomy are agnostic to model and language, suggesting applicability beyond English GPT-2.

4. Use in Retrieval-Augmented Generation (RAG) Competitions

FineWeb-10BT has become a de facto standard for large-scale, reproducible RAG benchmarking, most notably as the official corpus for the SIGIR 2025 LiveRAG Challenge (Carmel et al., 7 Jul 2025, Cofala et al., 17 Jun 2025, Bakagianni et al., 18 Jun 2025, Fensore et al., 27 Jun 2025). During LiveRAG, teams developed complex RAG pipelines under constraints:

  • Retrieval from provided BM25 or dense indices over 10–15M documents, with up to 512-token chunks (~75M passages in some builds).
  • Answer synthesis using Falcon3-10B-Instruct, with prompt context size limited by model constraints.
  • Evaluation via LLM-based “judge” scoring as well as human review, with correctness and faithfulness as primary metrics.

A typical high-ranking pipeline incorporates:

  1. Parallel BM25 and dense (e.g., E5) retrieval (top-200).
  2. Score fusion and selection of top-k (commonly 100).
  3. Optional neural re-ranking to filter down to top-5 to top-10 high-precision passages.
  4. Prompting Falcon3-10B-Instruct using either a denoising instruction (e.g., InstructRAG) or diversification strategies (e.g., via clustering).

Notable systems include TopClustRAG, which performs k-means clustering over passage representations (TF–IDF+SVD) for semantic diversification and multi-stage LLM synthesis, achieving 2nd place in faithfulness on LiveRAG (Bakagianni et al., 18 Jun 2025), and the RAGtifier pipeline, which uses Pinecone retrieval followed by BGE-M3 reranking and achieves high correctness and faithfulness (Cofala et al., 17 Jun 2025). In systematic evaluations, hybrid fusion of sparse and dense retrieval yields higher recall at deeper ranks, and neural re-ranking delivers large MAP improvements, albeit with substantial computational cost (Fensore et al., 27 Jun 2025).

System Correctness Faithfulness Distinctive Features
RAGtifier 1.13 (4th) 0.55 (4th) Pinecone+BGE-M3 rerank+InstructRAG
TopClustRAG 0.685 (7th) 0.460 (2nd) Hybrid+K-means+cluster synthesis
Hybrid baseline BM25+E5, no neural rerank

Correctness combines coverage and relevance; faithfulness evaluates evidence grounding.

5. Dynamic Test Sets, Prompt Engineering, and Analysis

The LiveRAG Challenge leverages “dynamic” test sets generated at evaluation time from synthetic question templates (DataMorgana), spanning diverse axes: factuality, premise, phrasing, linguistic variation, user expertise (Fensore et al., 27 Jun 2025). Index snapshots may “move” over time to simulate web drift, requiring robust and adaptive retrieval.

Evaluation employs a range of metrics:

  • Retrieval: Mean Average Precision (MAP), nDCG@10, Recall@k, MRR.
  • Generation: ROUGE, BLEU, embedding-based cosine similarity (e.g., MiniLM), and refusal rate (fraction of properly abstained answers when context is insufficient).

Prompt optimization, including few-shot and chain-of-thought (CoT) variants (e.g., DSPy toolkit), can markedly improve semantic similarity (cosine sim 0.771 vs 0.668 for base prompt), but often at the expense of over-confidence (i.e., zero refusals). Conservative base-prompt hybrids maintain a better balance between performance and answer abstention.

Vocabulary alignment between query and document is the strongest predictor of RAG success: document-similar query phrasing increases cosine similarity and reduces refusal rates, while document-distant phrasing degrades performance (Fensore et al., 27 Jun 2025). Factoid and verbose questions are more tractable than open-ended or short queries, with similar variation across user expertise and cluster definitions.

6. FinerWeb-10BT in Optical Networking (Distinct System Homonym)

Notably, “FinerWeb-10BT” also refers to a high-capacity bidirectional 10 Gb/s free-space-optical (FSO) bridge described by Honz & Schrenk (Honz et al., 21 Mar 2025). This system connects Ethernet ports over an FSO channel with advanced turbulence mitigation via wavelength-set diversity, centralized beamforming, and spectral channel sounding. It achieves robust performance (BER ≤ 10910^{-9}, full-duplex 2×10 Gb/s over 63 m links) compared with prior FSO bridges, introducing architectural innovations such as a 91-core focal-plane array, combined with PID-based feedback for beam steering, and maximal-ratio-combining across wavelength sets. The two meanings (web corpus vs. FSO bridge) are contextually disjoint but share the “10BT” label for 10 GB/s throughput or token scale.

7. Significance and Influence

FineWeb-10BT and its filtered derivatives have established principled benchmarks for:

  • Evaluating retrieval architectures under scale, heterogeneity, and adversarial noise.
  • Data-centric LLM training with empirically validated, semantics-driven filtering, demonstrating improved accuracy and efficiency over traditional heuristics even with substantial data reduction (Henriksson et al., 13 Jan 2025).
  • Fair comparison of diverse RAG prompting and synthesizing approaches, including clustering, cluster-level synthesis, and LLM-based answer abstention and correctness judgment (Carmel et al., 7 Jul 2025, Bakagianni et al., 18 Jun 2025).
  • Providing a robust, openly accessible resource for future work in scalable RAG, hybrid information retrieval, and large-scale semantic evaluation.

A plausible implication is that line-level, LLM-driven filtering (as in FinerWeb-10BT) will displace coarse heuristics as the dominant approach in the coming generation of model-centric data curation pipelines. Similarly, the retrieval and synthesis architectures tested on FineWeb-10BT may become de facto baselines for high-fidelity, trustworthy RAG systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FinerWeb-10BT.