WebMainBench-Structured Benchmark

Updated 1 December 2025

WebMainBench-Structured is a comprehensive benchmark for evaluating main-content extraction that combines scale, diversity, and rigorous human annotation.
It comprises 7,887 web pages from over 5,400 unique domains, addressing previous limitations with extensive layouts and multilingual content.
The benchmark utilizes semantic block annotation, HTML simplification, and ROUGE-N F1 metrics to provide a realistic and challenging evaluation framework.

WebMainBench-Structured is a comprehensive, gold-standard benchmark designed for main-content extraction from a large and diverse corpus of web pages. Developed in the context of evaluating and advancing automated HTML content extraction systems, WebMainBench addresses limitations in previous benchmarks by providing scale, diversity, rigorous annotation, and fine-grained metadata. The dataset and its associated evaluation protocol serve as critical infrastructure for measuring progress in web content extraction, document parsing, and LM-based information retrieval tasks.

1. Dataset Scope and Motivation

WebMainBench contains 7,887 web pages, each annotated with human-verified main-content extraction labels. The dataset was developed to address the inadequacies of existing evaluation sets, which were limited in both scale (≤1,400 pages) and scope (narrow content or layout diversity). WebMainBench is approximately seven times larger than previous public alternatives and encompasses substantial diversity in layout (e.g., articles, tables, code blocks, math, forums), language (English, Chinese, and others), and difficulty (“easy,” “medium,” “hard”). Each sample is paired with rich metadata, enabling detailed stratified and comparative analyses.

Primary intended applications include precise benchmarking of boilerplate removal and main-content extraction methods, and stress-testing extraction systems across a broad range of layout and content pathologies (e.g., multilingual pages, documents with complex structure, or conversational forums) (Liu et al., 28 Nov 2025).

2. Data Collection and Annotation Process

The dataset was constructed as follows:

Source Composition: 90 % of entries were randomly sampled from Common Crawl, targeting the diverse “long tail” of the web; 10 % were drawn from top-ranked sites (using Chinaz Alexa rankings) to ensure representation from professionally engineered home pages.
Domain Diversity: The dataset spans 5,434 unique top-level domains and 5,904 unique second-level domains.
Content Types: The corpus includes news, blogs, Q&A and forum threads, product and “how-to” pages, and subtypes containing tables, code, math, and conversational interfaces.
Language Annotation: Language metadata is provided as either “en” or “non_en,” assigned using GPT-5. The pipeline intentionally balances English, Chinese, and other languages, with the benchmark reflecting this distribution.
Block-Level Annotation: Annotation is performed at the level of “semantic blocks”—contiguous HTML segments typically rendering as discrete units (e.g., paragraphs, table bodies, list items). For the training set, each block receives a binary label: “main” (primary content) or “other” (boilerplate, navigation, ads).
Annotation Guidelines: Material is labeled according to contextual integrity—integral article content (e.g., abstracts, reference lists) is included; extraneous elements such as sidebars, footers, and auto-generated metadata (e.g., share buttons, view counts) are excluded.

A plausible implication is that by combining random and curated sources, and annotating at the semantic-block level, WebMainBench provides a challenging and realistic testbed for extraction methods intended to generalize to the open web.

3. Formal Structure and Data Schema

Each entry in WebMainBench is stored as a JSON object (one per line in a JSONL file). The typical schema is:

{
  "track_id": "XXXX",
  "html": "<html>…</html>",        // raw page HTML
  "main_html": "<html>…</html>",   // ground-truth subtree
  "convert_main_content": "# …",   // main content as Markdown
  "meta": {
    "language": "en",              // "en" or "non_en"
    "style": "Normal",             // "Normal" or "Conversational"
    "level": "easy",               // "easy" / "medium" / "hard"
    "table": "without",            // "with" / "without"
    "code": "without",             // "with" / "without"
    "equation": "without"          // "with" / "without"
  }
}

Key components:

html: Full HTML of the source page.
main_html: Annotated subtree corresponding to main content.
convert_main_content: Ground-truth main content converted to Markdown for textual comparison.
meta: Metadata for filtering or stratified reporting (language, style, difficulty, presence of tables/code/equations).

No official train/dev/test division is provided; the benchmark is intended solely for evaluation.

The extraction task is cast as a sequence classification problem over semantic blocks $X = [x_1, ..., x_n]$ , with the model $f_\theta$ predicting $y_i' \in \{0,1\}$ for each block.

4. Preprocessing: HTML Simplification and Semantic Block Chunking

To address inference cost and input-size bottlenecks in LLMs, WebMainBench is used in conjunction with a preprocessing pipeline that includes:

Tag Pruning: Removal of non-content tags (<style>, <script>, <header>, <aside>, etc.).
Attribute Filtering: Retention of only class and id attributes.
Block-Level Chunking: Division into semantic blocks using line-break-inducing tags (<p>, <ul>, <ol>, <table>), with further splitting of overlarge lists/tables.
Truncation: Partial reduction of overlong blocks (e.g., limit to 200 characters or a subset of table cells).

This process produces a token-efficient representation, with the average token count after simplification denoted $T_\text{simplified}$ and the original as $T_\text{raw}$ , with

$T_\text{simplified} \approx 0.22 \cdot T_\text{raw}$

This suggests improved computational tractability—especially significant for smaller or context-limited models—without sacrificing structural fidelity.

5. Evaluation Protocol and Metrics

The primary metric for evaluation is ROUGE-N F1, computed on the Markdown (convert_main_content) output:

$\text{ROUGE-N F}_1 = \frac{2 \cdot P_N \cdot R_N}{P_N + R_N}$

where $P_N$ and $R_N$ represent precision and recall over $N$ -gram overlaps ( $N=5$ ; Jieba tokenization).

Dripper, a 0.6 B parameter model, demonstrates the following results on WebMainBench (Liu et al., 28 Nov 2025):

Model	ROUGE-N F1 (%)
Dripper (0.6B)	81.58
Dripper + fallback (Trafilatura)	83.13

No separate development or training split is provided; the entire dataset (7,887 pages) is reserved for test-time evaluation. The evaluation is further supported by a controlled decoding process: a custom logits processor (finite-state machine) restricts vocabulary and output formatting, ensuring adherence to strict JSON-based output constraints and eliminating hallucinations.

6. Usage Guidelines and Licensing

WebMainBench, its annotation schema, and related extraction models are to be released at https://github.com/opendatalab/MinerU-HTML. The authors specify no restrictive license in the paper but anticipate an open-source or CC-BY license, making the dataset suitable for academic and commercial exploitation. Users are advised to verify final licensing terms upon dataset release.

WebMainBench-Structured thus constitutes a pivotal benchmark for advancing and evaluating main-content extraction research, featuring large scale, fine-grained annotation, metadata for stratified analysis, and a rigorous evaluation methodology (Liu et al., 28 Nov 2025).

Markdown Upgrade to Chat

References (1)

Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WebMainBench-Structured.