WebMainBench-Structured Benchmark
- WebMainBench-Structured is a comprehensive benchmark for evaluating main-content extraction that combines scale, diversity, and rigorous human annotation.
- It comprises 7,887 web pages from over 5,400 unique domains, addressing previous limitations with extensive layouts and multilingual content.
- The benchmark utilizes semantic block annotation, HTML simplification, and ROUGE-N F1 metrics to provide a realistic and challenging evaluation framework.
WebMainBench-Structured is a comprehensive, gold-standard benchmark designed for main-content extraction from a large and diverse corpus of web pages. Developed in the context of evaluating and advancing automated HTML content extraction systems, WebMainBench addresses limitations in previous benchmarks by providing scale, diversity, rigorous annotation, and fine-grained metadata. The dataset and its associated evaluation protocol serve as critical infrastructure for measuring progress in web content extraction, document parsing, and LM-based information retrieval tasks.
1. Dataset Scope and Motivation
WebMainBench contains 7,887 web pages, each annotated with human-verified main-content extraction labels. The dataset was developed to address the inadequacies of existing evaluation sets, which were limited in both scale (≤1,400 pages) and scope (narrow content or layout diversity). WebMainBench is approximately seven times larger than previous public alternatives and encompasses substantial diversity in layout (e.g., articles, tables, code blocks, math, forums), language (English, Chinese, and others), and difficulty (“easy,” “medium,” “hard”). Each sample is paired with rich metadata, enabling detailed stratified and comparative analyses.
Primary intended applications include precise benchmarking of boilerplate removal and main-content extraction methods, and stress-testing extraction systems across a broad range of layout and content pathologies (e.g., multilingual pages, documents with complex structure, or conversational forums) (Liu et al., 28 Nov 2025).
2. Data Collection and Annotation Process
The dataset was constructed as follows:
- Source Composition: 90 % of entries were randomly sampled from Common Crawl, targeting the diverse “long tail” of the web; 10 % were drawn from top-ranked sites (using Chinaz Alexa rankings) to ensure representation from professionally engineered home pages.
- Domain Diversity: The dataset spans 5,434 unique top-level domains and 5,904 unique second-level domains.
- Content Types: The corpus includes news, blogs, Q&A and forum threads, product and “how-to” pages, and subtypes containing tables, code, math, and conversational interfaces.
- Language Annotation: Language metadata is provided as either “en” or “non_en,” assigned using GPT-5. The pipeline intentionally balances English, Chinese, and other languages, with the benchmark reflecting this distribution.
- Block-Level Annotation: Annotation is performed at the level of “semantic blocks”—contiguous HTML segments typically rendering as discrete units (e.g., paragraphs, table bodies, list items). For the training set, each block receives a binary label: “main” (primary content) or “other” (boilerplate, navigation, ads).
- Annotation Guidelines: Material is labeled according to contextual integrity—integral article content (e.g., abstracts, reference lists) is included; extraneous elements such as sidebars, footers, and auto-generated metadata (e.g., share buttons, view counts) are excluded.
A plausible implication is that by combining random and curated sources, and annotating at the semantic-block level, WebMainBench provides a challenging and realistic testbed for extraction methods intended to generalize to the open web.
3. Formal Structure and Data Schema
Each entry in WebMainBench is stored as a JSON object (one per line in a JSONL file). The typical schema is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
{
"track_id": "XXXX",
"html": "<html>…</html>", // raw page HTML
"main_html": "<html>…</html>", // ground-truth subtree
"convert_main_content": "# …", // main content as Markdown
"meta": {
"language": "en", // "en" or "non_en"
"style": "Normal", // "Normal" or "Conversational"
"level": "easy", // "easy" / "medium" / "hard"
"table": "without", // "with" / "without"
"code": "without", // "with" / "without"
"equation": "without" // "with" / "without"
}
} |
Key components:
html: Full HTML of the source page.main_html: Annotated subtree corresponding to main content.convert_main_content: Ground-truth main content converted to Markdown for textual comparison.meta: Metadata for filtering or stratified reporting (language, style, difficulty, presence of tables/code/equations).
No official train/dev/test division is provided; the benchmark is intended solely for evaluation.
The extraction task is cast as a sequence classification problem over semantic blocks , with the model predicting for each block.
4. Preprocessing: HTML Simplification and Semantic Block Chunking
To address inference cost and input-size bottlenecks in LLMs, WebMainBench is used in conjunction with a preprocessing pipeline that includes:
- Tag Pruning: Removal of non-content tags (
<style>,<script>,<header>,<aside>, etc.). - Attribute Filtering: Retention of only
classandidattributes. - Block-Level Chunking: Division into semantic blocks using line-break-inducing tags (
<p>,<ul>,<ol>,<table>), with further splitting of overlarge lists/tables. - Truncation: Partial reduction of overlong blocks (e.g., limit to 200 characters or a subset of table cells).
This process produces a token-efficient representation, with the average token count after simplification denoted and the original as , with
This suggests improved computational tractability—especially significant for smaller or context-limited models—without sacrificing structural fidelity.
5. Evaluation Protocol and Metrics
The primary metric for evaluation is ROUGE-N F1, computed on the Markdown (convert_main_content) output:
where and represent precision and recall over -gram overlaps (; Jieba tokenization).
Dripper, a 0.6 B parameter model, demonstrates the following results on WebMainBench (Liu et al., 28 Nov 2025):
| Model | ROUGE-N F1 (%) |
|---|---|
| Dripper (0.6B) | 81.58 |
| Dripper + fallback (Trafilatura) | 83.13 |
No separate development or training split is provided; the entire dataset (7,887 pages) is reserved for test-time evaluation. The evaluation is further supported by a controlled decoding process: a custom logits processor (finite-state machine) restricts vocabulary and output formatting, ensuring adherence to strict JSON-based output constraints and eliminating hallucinations.
6. Usage Guidelines and Licensing
WebMainBench, its annotation schema, and related extraction models are to be released at https://github.com/opendatalab/MinerU-HTML. The authors specify no restrictive license in the paper but anticipate an open-source or CC-BY license, making the dataset suitable for academic and commercial exploitation. Users are advised to verify final licensing terms upon dataset release.
WebMainBench-Structured thus constitutes a pivotal benchmark for advancing and evaluating main-content extraction research, featuring large scale, fine-grained annotation, metadata for stratified analysis, and a rigorous evaluation methodology (Liu et al., 28 Nov 2025).