MinerU-HTML: Scalable Web Content Extraction
- MinerU-HTML is a framework that combines advanced in-browser mining with model-based HTML parsing to create AI-pretraining-ready corpora with high fidelity.
- It employs a two-stage pipeline that converts cleaned HTML to semantically-rich JSON and then to Markdown, preserving structured elements like code, formulas, and tables with >90% edit similarity.
- Applied in constructing the AICC corpus, MinerU-HTML significantly enhances LLM pretraining performance, achieving a +1.08 percentage point accuracy gain over heuristic extractions.
MinerU-HTML refers to two distinct but related concepts in the context of web-scale computation and corpus construction: (1) advanced in-browser HTML-based mining frameworks capable of persistent, distributed computation, and (2) a model-based HTML content extraction pipeline for building high-quality, AI-pretraining-ready corpora. Both threads highlight the intersection of web platform APIs and scalable automation, but with orthogonal aims: abuse of browser resources for cryptojacking and DDoS versus fidelity extraction of web content for language modeling datasets. Below, each of these axes is examined with detailed technical rigor.
1. Model-Based HTML Parsing: Motivation and Core Approach
HTML-to-text extraction is a critical bottleneck in constructing web-scale datasets for LLM pretraining. Historically, systems such as Trafilatura and Resiliparse have employed heuristic methods, relying on text-density features or DOM traversal rules. These systems frequently fail to preserve structured elements (e.g., formulas, code blocks, tables) due to their limited semantic awareness, resulting in significant information loss—MathML is often discarded, code indentation and backtick delimiters are removed, and tables are flattened with their row/column relationships destroyed (Ma et al., 20 Nov 2025).
MinerU-HTML addresses this by treating HTML extraction as a supervised sequence labeling task at the block level. By segmenting HTML documents into blocks , the system assigns binary labels —with 1 indicating "main content" and 0 as "boilerplate/other." The labeling is performed by a dedicated classifier , fine-tuned from Qwen3-0.6B (24 layers, 2,048 hidden units, 32K context, 0.6B parameters, multilingual). The model's decision space is strictly controlled: only well-formed JSON tokens encoding the binary labels are permitted at generation, enforced by a constrained finite-state machine decoder. The cross-entropy loss is summed over all blockwise predictions.
2. Two-Stage Formatting and Semantic Preservation
Following extraction, MinerU-HTML employs a two-stage formatting pipeline, which first semantically labels extracted blocks, then produces AI-ready Markdown.
- Stage 1 translates the cleaned HTML into an ordered JSON list, with explicit type labels for content elements:
Supported types include title, paragraph, list, code block, inline code, formula (inline/display), table, image, video, audio. Specialized detectors accurately group1
{ "type": T, "content": { ... } }<pre>,<code>, discriminate between Markdown-eligible and complex tables (the latter are retained as HTML), and wrap formulas as either(inline) or(display), with language inference for code blocks and LaTeX symbol preservation for math. - Stage 2 renders the JSON list into Markdown via a consistent map. Example (from pseudocode):
This ensures high-fidelity preservation of code and formulas, respectively.1 2 3 4 5
for element in content_list: if element['type']=='code': markdown.append('```' + element['content']['lang']) markdown.append(element['content']['raw']) markdown.append('```')
3. Benchmarks, Evaluation, and Structured Element Recovery
MinerU-HTML's efficacy is validated using MainWebBench, a set of 7,887 human-annotated HTML pages covering both the general Common Crawl long tail and complex layouts from Alexa's top-tier domains. Each page is annotated with rich metadata (language, style, structured content types). Evaluation emphasizes not just ROUGE-N F1 overlap (for main content), but also strict metrics for structural preservation:
- EditSim for code and formulas:
- TEDS (tree-edit distance-based similarity) for tables.
Comparative results:
| Extractor | Main Content F1 | Code EditSim | Formula EditSim | Table TEDS |
|---|---|---|---|---|
| Trafilatura | 0.624 | 0.1305 | 0.6107 | 0.3405 |
| Resiliparse | 0.623 | 0.0641 | 0.6778 | 0.0227 |
| MinerU-HTML | 0.818 | 0.9093 | 0.9399 | 0.7388 |
MinerU-HTML outperforms heuristics by large margins across all fidelity metrics (Ma et al., 20 Nov 2025). Structured element preservation is particularly notable: code and formulas are restored with >90% edit similarity.
4. Corpus Construction and Scaling for LLM Pretraining
The MinerU-HTML pipeline underlies the construction of AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus. The system processes two Common Crawl snapshots, first by clustering pages based on DOM templates. For each template, a single page is run through MinerU-HTML's sequence labeling inference (on GPU), then distilled into XPath/CSS rules for CPU-scale extraction on other pages, with only 0.4% of pages requiring LM inference.
Post-extraction, both AICC and baseline TfCC (Trafilatura-extracted) corpora undergo identical five-stage post-processing: SHA-256 deduplication, language ID filtering (FastText), quality filtering (Gopher/LLM rules), safety filtering, and fuzzy deduplication (MinHash + LSH). The final sizes (after filtering) are 372B tokens (AICC) and 317B tokens (TfCC), with AICC documents being on average 1.16× longer and preferred by an LLM judge in 72% of pairwise comparisons.
5. Downstream Impact on LLM Pretraining
In controlled pretraining experiments (1.5B-parameter transformer, 62B-token training budget, strict parity on filtering and training), MinerU-HTML’s corpus (AICC) produces measurable, statistically significant downstream gains:
- Overall accuracy (13 benchmarks):
- AICC: 50.82%
- TfCC: 49.74% (∆=+1.08 pp)
- FineWeb: 49.61%
- RefinedWeb: 49.13%
- Category breakdown:
| Category | AICC | TfCC | FineWeb | RefinedWeb |
|---|---|---|---|---|
| General Knowledge | 47.54% | 45.61% | 46.86% | 44.57% |
| Reasoning | 59.83% | 59.34% | 60.69% | 59.43% |
| Reading Comprehension | 42.37% | 42.02% | 36.68% | 41.10% |
Performance gains persist throughout training, with models trained on higher-quality extraction consisently outperforming those trained on heuristically extracted content (Ma et al., 20 Nov 2025). This provides direct empirical evidence that structured preservation in extraction is as impactful for LLM capabilities as aggressive web corpus filtering.
6. Best Practices, Limitations, and Future Directions
- HTML extraction should not be treated as a solved step—improvements in extraction can have effects on par with those from improved filtering.
- Model-based sequence labeling, typified by MinerU-HTML, robustly restores code, tables, and formulas, and allows further improvement via scaling and better supervision.
- Future work includes integrating JavaScript/SPA rendering, learned rather than heuristic cluster assignment, and extension to multimodal extraction (images, video).
- Released artifacts include the MinerU-HTML extraction tool, MainWebBench benchmark, and AICC corpus (Ma et al., 20 Nov 2025).
This suggests that further alignment between extraction pipeline quality and LLM requirements may yield yet higher ceiling effects for model pretraining. A plausible implication is the need for dynamic, extensible extractors that evolve with the complexity of web layouts and markup conventions.
References:
- "AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser" (Ma et al., 20 Nov 2025).
- "Master of Web Puppets: Abusing Web Browsers for Persistent and Stealthy Computation" (Papadopoulos et al., 2018).
- "A first look at browser-based Cryptojacking" (Eskandari et al., 2018).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free