MinerU Parser: High-Precision PDF Extraction

Updated 10 November 2025

MinerU Parser is an open-source, multi-module system designed for high-precision content extraction from diverse PDF documents including academic papers and textbooks.
It integrates fine-tuned vision-language models with rule-based pre- and postprocessing to accurately detect layouts, formulas, and tables.
Successive versions, notably MinerU2.5, enhance performance through a decoupled two-stage parsing pipeline that boosts throughput and accuracy.

MinerU Parser is an open-source, multi-module system designed for high-precision content extraction from heterogeneous PDF documents. It combines deeply fine-tuned vision-LLMs, rule-based pre- and postprocessing, and a modular inference pipeline. The system is engineered for robust performance across academic papers, textbooks, scanned images, and structurally complex or visually noisy documents. Successive versions, notably MinerU and MinerU2.5, have established strong performance baselines and state-of-the-art benchmarks through decoupled architecture and extensive training on curated, multi-type document corpora (Wang et al., 27 Sep 2024, Niu et al., 26 Sep 2025).

1. System Architecture and Pipeline

MinerU’s architecture implements a four-stage sequential pipeline:

Document Preprocessing The input PDF is validated for format integrity (type, encryption, password protection) and classified as either “text-based” or “scanned”. Supported languages include English and Chinese. PyMuPDF provides low-level I/O and metadata management.
Document Content Parsing The core relies on the PDF-Extract-Kit, which bundles five specialized models:
- Layout Detection: LayoutLMv3-base with additional detection head.
- Formula Detection: YOLOv8 backbone with detection head.
- Table Recognition: TableMaster (multi-stage) and StructEqTable (end-to-end Transformer).
- Formula Recognition: UniMERNet architecture.
- OCR: PaddleOCR for regions not otherwise classified.
Post-Processing This stage resolves overlapping bounding boxes using containment and IoU-based heuristics. Grouping into human reading order follows a “top→bottom, left→right” strategy.
Format Conversion Intermediate output is serialized as JSON; rendered output is available in Markdown or custom JSON schemas. Optional support for cropped imagery of tables and figures is included.

MinerU2.5 introduces a coarse-to-fine, two-stage parsing schema:

Stage I: Performs global layout analysis on thumbnails (e.g., 1036×1036) using NaViT + Patch Merger + LLM components.
Stage II: Executes targeted high-resolution recognition (text/formula/table) on native-resolution image crops, using context-specific prompts and the same VLM pipeline (Niu et al., 26 Sep 2025).

2. Modeling Approaches and Training Corpus

PDF-Extract-Kit (MinerU)

Layout Detection: LayoutLMv3-base model fine-tuned for document structural categories. Uses standard object-detection loss:

$L = \sum_{i} [L_{cls}(y_i, p_i) + \lambda L_{box}(b_i, \hat{b}_i)]$

where $L_{cls}$ is classification (cross entropy), $L_{box}$ is either $\ell_1$ or GIoU regression loss.

Formula Detection: YOLOv8, optimized for inline/display/ignore classes, loss: $L_{box} (\text{CIoU}) + L_{obj} + L_{cls}$ .
Formula Recognition: UniMERNet (ResNet/Swin encoder + Transformer decoder) with seq2seq cross-entropy on LaTeX tokens.
Table Recognition: TableMaster (multi-stage grid/OCR) and StructEqTable (direct HTML/LaTeX markup); trained on PubTabNet and DocGenome.
OCR: PaddleOCR (detection + CRNN).

Training involves ∼150K diverse document pages, annotated for layout, formulas, and more. Iterative sampling upweights mis-classified categories.

MinerU2.5 Model Suite

Backbone: NaViT vision encoder (675M), Qwen2-Instruct LLM (500M), integrated via pixel-unshuffle patch merging and MLP projection.
Prompts: Decoupled, stage-specific—e.g., “Layout Detection:”, “Text Recognition:”, “Table Recognition:”.
Training Corpus: Over 6.9M pretraining samples spanning layout, text, formula, and table blocks. Auto-annotation/refinement leverages prior MinerU2, Qwen2.5-VL-72B, UniMERNet, and a dedicated QA revision loop. Final fine-tuning set includes 630K hard-case, IMIC-minted annotations.

3. Algorithms, Mathematical Formulations, and Pseudocode

Intersection over Union (IoU):

$\mathrm{IoU}(A, B) = \frac{\mathrm{area}(A \cap B)}{\mathrm{area}(A \cup B)}$

Overlap Resolution:

def resolve_overlaps(bboxes):
    filtered=[]
    for A in bboxes:
        if not any(contains(B,A) for B in bboxes if B!=A):
            filtered.append(A)
    for A,B in combinations(filtered,2):
        if IoU(A,B)>0 and A.type=='text' and B.type=='text':
            A, B = shrink_boxes(A,B)
    return filtered

Reading-Order Grouping:

1 2	def group_by_reading_order(bboxes): return sorted(bboxes, key=lambda b: (b.y1, b.x1))

MinerU2.5 Losses:
- Layout: $L_{layout} = \lambda_c L_{cls} + \lambda_r L_{reg}$ with $L_{reg} = \text{SmoothL}_1(bbox_{pred}, bbox_{gt}) + \alpha L_{IoU}$ .
- Recognition: $L_{seq} = -\sum_{t=1}^T \log P(y_t|y_{<t}, I)$

4. Performance Benchmarks

MinerU Baselines (Layout, Formula)

Model	mAP	AP50	AR50	Comments
DocXchain	52.8	69.5	77.3	Academic Papers
LayoutLMv3-SFT (MinerU)	77.6	93.3	95.5	Academic Papers, highest across metrics

Model	AP50	AR50	Comments
Pix2Text-MFD	60.1	64.6	Academic Papers
YOLOv8-FT (MinerU)	87.7	89.9	Highest for formula detection

Model	CDM	Comments
Pix2tex	0.636	Formula recognition
UniMERNet	0.968	MinerU, best result

MinerU2.5 (End-to-End, Efficiency)

Model	Params	Overall↑	Text^Edit↓	Formula^CDM↑	Table^TEDS↑	Table^TEDS-S↑	Order^Edit↓
MonkeyOCR-p-3B	3.7 B	88.85	0.075	87.25	86.78	90.63	0.128
MinerU2.5	1.2B	90.67	0.047	88.46	88.22	92.38	0.044

Model	Params	Pages/s	Hardware
dots.ocr	3.0 B	0.28	A100 80G
MonkeyOCR-p	3.7 B	0.47	A100 80G
MinerU2.5	1.2B	2.12	A100 80G

MinerU2.5 achieves higher accuracy across all tasks with significantly lower parameter count and >4× throughput relative to contemporary 3–4B general-purpose systems. Ablations indicate decoupling reduces total FLOPs by approximately 10× compared to monolithic native-resolution VLMs.

5. Integration and Customization

API and CLI Flexibility

Configuration is achievable through YAML files or command-line overrides (e.g., enable_ocr, languages, drop_headers/footers, model selection for tables/formulas).
Example (Python):

1
2
3

from mineru import MinerU
parser = MinerU(enable_ocr=True, languages=['en','zh'], drop_headers=True, drop_footers=True)
result = parser.parse('sample.pdf')

Example (CLI):

1	mineru -i sample.pdf -o output.md --format markdown --lang en --drop-footers --enable-table-crop

Batch and Pipeline Integration
- Batch processing via simple Python loops or the provided multiprocessing loader.
- Embedding in Retrieval-Augmented Generation (RAG) pipelines is facilitated by Markdown output.
Custom Callbacks and Postprocessing
- User-defined overlap-resolution can be set via parser.set_overlap_handler(my_fn).
- For OCR-heavy (text-based) PDFs, only_api_extraction=True accelerates performance by avoiding redundant region detection.
Hardware Requirements
- GPUs are required for UniMERNet and YOLOv8; LayoutLMv3 and PaddleOCR operate on CPU.

6. Algorithmic and Design Innovations

MinerU and MinerU2.5 exemplify design decisions that emphasize both quality and computational efficiency:

Fine-tuned models and annotation strategies (including IMIC-driven hard case mining and AI-augmented QA).
Stage decoupling (global structure on thumbnails; local recognition on high-res crops) in MinerU2.5 yields substantial compute savings for negligible accuracy trade-off—e.g., pixel-unshuffle reduces Stage I FLOPs by ∼25% for <0.1% loss in F1.
Multi-model, modular architecture allows tailored deployment according to use case and resource availability.

A plausible implication is that such modular, decoupled approaches are likely to be further generalized in document intelligence, as the trade-off between accuracy and inference cost becomes increasingly relevant in large corpus and cloud-scale processing.

7. Impact, Limitations, and Prospects

MinerU establishes open-source, high-fidelity content extraction for the research and enterprise ecosystem, setting new SOTA baselines in layout, formula, and table recognition. Its robust design is well-suited for academic, educational, and archival material digitization, as well as downstream LLM integration. Limitations include the necessity of a GPU for peak recognition accuracy on complex layouts and relatively narrow language support. Future directions could include expanding language coverage and joint learning strategies to further reduce resource consumption without sacrificing accuracy.

(Wang et al., 27 Sep 2024, Niu et al., 26 Sep 2025)