Papers
Topics
Authors
Recent
2000 character limit reached

MainWebBench: Evaluating Web Extraction Methods

Updated 21 November 2025
  • MainWebBench is a comprehensive, human-annotated benchmark that evaluates web content extraction fidelity including preservation of structured elements.
  • It employs rigorous metrics and dual-pass human annotation to assess extraction methods across diverse real-world webpages.
  • The benchmark's scalable dataset and detailed evaluation protocols enable statistically robust comparisons that advance LLM pretraining quality.

MainWebBench is a comprehensive, human-annotated benchmark purpose-built for evaluating web content extraction methods, with an emphasis on both overall text fidelity and robust preservation of complex structured elements such as code blocks, mathematical formulas, and tables. MainWebBench addresses critical limitations in prevailing web corpus construction methodologies, which typically rely on heuristic-based extraction and lack rigorous ground-truth oversight for structured content. It provides a reproducible platform for quantifying extractor performance, offering granular annotations, rigorous metrics, and a scale suitable for statistically meaningful comparisons, thus representing a key advance in benchmarking the extraction step critical for pretraining LLMs and associated downstream applications (Ma et al., 20 Nov 2025).

1. Motivation and Objectives

Web data underpins modern LLM pretraining, but the quality of HTML-to-text extraction has long been treated as a black box, dominated by density-based heuristics. This has resulted in systematic loss or corruption of high-value structured elements (e.g., MathJax equations, code blocks, complex tables), which are indispensable for scientific, technical, and educational web content. MainWebBench was designed to:

  • Quantify overall content extraction fidelity across heterogeneous web domains.
  • Explicitly assess structured element preservation, a requirement neglected by prior benchmarks.
  • Enable rigorous, statistically robust evaluation of both traditional and model-based extraction pipelines at scale (7,887 pages) (Ma et al., 20 Nov 2025).

This focus directly addresses the absence of large-scale, human-annotated ground truth and structured-element fidelity metrics in previous benchmarks.

2. Dataset Construction and Semantic Coverage

MainWebBench was assembled from 7,887 real-world web pages drawn via a hybrid sampling strategy: 90% random selection from the Common Crawl web graph to ensure coverage across the web’s “long tail,” and 10% from top-trafficked domains (via Chinaz Alexa) to represent professionally designed sites. This yields annotations spanning 5,434 unique top-level and 5,904 second-level domains.

The dataset is deliberately heterogeneous, covering news articles, technical tutorials, academic pages, Q&A threads, blogs, and corporate content. Complexity is quantified by DOM characteristics and other page features, with subsets stratified into simple (lowest 30%), medium (middle 40%), and hard (top 30%) classes.

Each page is annotated with:

  • Ground-truth Main-HTML (the primary content as a valid DOM subtree).
  • Markdown conversion of the Main-HTML.
  • Metadata: language, conversationality, complexity level, and presence indicators for code, tables, or equations.

An additional subset, WebMainBench-Structured (545 pages), targets structured-rich content for specialized evaluation:

  • Mathematical formulas: 257 pages
  • Code blocks: 127 pages
  • Tables: 179 pages

A fixed semantic tagset is employed for annotation: headings, paragraphs, inline/display formulas, inline/code blocks, tables (by type), lists, images, video, and audio (Ma et al., 20 Nov 2025).

3. Annotation Protocol and Principles

The annotation workflow enforces contextual integrity and human-authorship:

  1. Contextual integrity: Only the content essential to the page’s main communicative purpose (e.g., abstracts, reference lists) is included; unrelated boilerplate, ads, and navigation are excluded.
  2. Human-generated content: Narrative text, user-contributed comments, and substantive paragraphs are preserved, with programmatic metadata (timestamps, counters) omitted.

Technical procedure involves:

  • Tag-level selection using a custom web annotation interface.
  • Dual-pass annotation: initial expert, independent reviewer, and final adjudication by a senior inspector.
  • Exclusion of any page with rendering failures (Ma et al., 20 Nov 2025).

This procedure establishes a high-fidelity reference for both main content and structured element extraction.

4. Evaluation Metrics and Benchmarking Protocol

MainWebBench employs a combination of textual and structured similarity metrics:

  • ROUGE-N F1 (N=5): Measures n-gram overlap for the overall text extraction between extractor output (Markdown) and reference. Defined as

$\mathrm{ROUGE}\mbox{-}N\,F_1 = \frac{2PR}{P+R}$

where PP is precision and RR recall on n-gram overlap.

  • EditSim: For code and formulas, uses normalized Levenshtein distance:

EditSim(s1,s2)=1EditDist(s1,s2)max(s1,s2)\mathrm{EditSim}(s_1, s_2) = 1 - \frac{\mathrm{EditDist}(s_1, s_2)}{\max(|s_1|, |s_2|)}

quantifying element-level similarity.

  • TEDS (Tree-Edit Distance Similarity): For tables, uses DOM tree-edit distance to capture structure and content preservation:

TEDS(T1,T2)=1EditDist(T1,T2)max(T1,T2)\mathrm{TEDS}(T_1, T_2) = 1 - \frac{\mathrm{EditDist}(T_1, T_2)}{\max(|T_1|, |T_2|)}

The protocol involves, for each MainWebBench page:

  1. Extraction by candidate tool to produce Main-HTML and Markdown.
  2. ROUGE-5 F1 computation against ground truth Markdown.
  3. For structured subset: alignment and EditSim on code blocks and formulas; TEDS on tables (Ma et al., 20 Nov 2025).

5. Quantitative Results and Comparative Analysis

MainWebBench supports systematic comparison of extraction methods, exemplified by the following results:

Mean ROUGE-N F1 across MainWebBench

Extractor All Simple Medium Hard Tables Code Equation Conversational
Resiliparse 0.623 0.710 0.628 0.530 0.547 0.647 0.783 0.535
Trafilatura 0.636 0.712 0.628 0.531 0.550 0.574 0.717 0.577
MinerU-HTML 0.818 0.884 0.818 0.754 0.769 0.837 0.889 0.767

Structured Element Preservation (WebMainBench-Structured)

Extractor Code EditSim Formula EditSim Table TEDS
Trafilatura 0.131 0.611 0.341
Resiliparse 0.064 0.678 0.023
MinerU-HTML 0.909 0.940 0.739

These results establish that model-based extraction (MinerU-HTML) achieves substantial improvements over heuristic baselines, particularly for code, mathematics, and table fidelity. MinerU-HTML’s overall ROUGE-5 F1 (0.8182) outpaces Trafilatura (0.6358) and Resiliparse (0.6233). For structured tokens, MinerU-HTML attains 0.9093 EditSim on code blocks and 0.9399 on formulas, versus Trafilatura’s 0.1305 and 0.6107, respectively. Table TEDS achieves 0.7388 (MinerU-HTML) versus 0.3405 (Trafilatura) (Ma et al., 20 Nov 2025).

6. Significance and Implications

MainWebBench demonstrates that the quality of web content extraction—especially for structured elements—is a critical determinant of downstream model performance. The evaluation suite’s rigor reveals large, previously hidden gaps between heuristic and model-driven extraction, with substantial implications for the construction of AI training corpora.

A key finding is the systematic underperformance of heuristics on code and math-heavy pages, which suggests that advances in extraction methodology may be as impactful for LLM pretraining as aggressive filtering or deduplication. Subsequent experiments with the AICC corpus (built using MainWebBench and MinerU-HTML) support this claim, showing improved average accuracy by more than 1 percentage point across 13 downstream benchmarks relative to heuristic-based extractions performed on the same filtered data.

The availability of MainWebBench, along with open release of ground-truth annotations and evaluation code, enables the research community to make reproducible, meaningful advances in the paper of web data extraction and its impact on broader AI applications (Ma et al., 20 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MainWebBench.