MinerU Parser: High-Precision PDF Extraction
- MinerU Parser is an open-source, multi-module system designed for high-precision content extraction from diverse PDF documents including academic papers and textbooks.
- It integrates fine-tuned vision-language models with rule-based pre- and postprocessing to accurately detect layouts, formulas, and tables.
- Successive versions, notably MinerU2.5, enhance performance through a decoupled two-stage parsing pipeline that boosts throughput and accuracy.
MinerU Parser is an open-source, multi-module system designed for high-precision content extraction from heterogeneous PDF documents. It combines deeply fine-tuned vision-LLMs, rule-based pre- and postprocessing, and a modular inference pipeline. The system is engineered for robust performance across academic papers, textbooks, scanned images, and structurally complex or visually noisy documents. Successive versions, notably MinerU and MinerU2.5, have established strong performance baselines and state-of-the-art benchmarks through decoupled architecture and extensive training on curated, multi-type document corpora (Wang et al., 27 Sep 2024, Niu et al., 26 Sep 2025).
1. System Architecture and Pipeline
MinerU’s architecture implements a four-stage sequential pipeline:
- Document Preprocessing The input PDF is validated for format integrity (type, encryption, password protection) and classified as either “text-based” or “scanned”. Supported languages include English and Chinese. PyMuPDF provides low-level I/O and metadata management.
- Document Content Parsing
The core relies on the PDF-Extract-Kit, which bundles five specialized models:
- Layout Detection: LayoutLMv3-base with additional detection head.
- Formula Detection: YOLOv8 backbone with detection head.
- Table Recognition: TableMaster (multi-stage) and StructEqTable (end-to-end Transformer).
- Formula Recognition: UniMERNet architecture.
- OCR: PaddleOCR for regions not otherwise classified.
- Post-Processing This stage resolves overlapping bounding boxes using containment and IoU-based heuristics. Grouping into human reading order follows a “top→bottom, left→right” strategy.
- Format Conversion Intermediate output is serialized as JSON; rendered output is available in Markdown or custom JSON schemas. Optional support for cropped imagery of tables and figures is included.
MinerU2.5 introduces a coarse-to-fine, two-stage parsing schema:
- Stage I: Performs global layout analysis on thumbnails (e.g., 1036×1036) using NaViT + Patch Merger + LLM components.
- Stage II: Executes targeted high-resolution recognition (text/formula/table) on native-resolution image crops, using context-specific prompts and the same VLM pipeline (Niu et al., 26 Sep 2025).
2. Modeling Approaches and Training Corpus
PDF-Extract-Kit (MinerU)
- Layout Detection: LayoutLMv3-base model fine-tuned for document structural categories. Uses standard object-detection loss:
where is classification (cross entropy), is either or GIoU regression loss.
- Formula Detection: YOLOv8, optimized for inline/display/ignore classes, loss: .
- Formula Recognition: UniMERNet (ResNet/Swin encoder + Transformer decoder) with seq2seq cross-entropy on LaTeX tokens.
- Table Recognition: TableMaster (multi-stage grid/OCR) and StructEqTable (direct HTML/LaTeX markup); trained on PubTabNet and DocGenome.
- OCR: PaddleOCR (detection + CRNN).
Training involves ∼150K diverse document pages, annotated for layout, formulas, and more. Iterative sampling upweights mis-classified categories.
MinerU2.5 Model Suite
- Backbone: NaViT vision encoder (675M), Qwen2-Instruct LLM (500M), integrated via pixel-unshuffle patch merging and MLP projection.
- Prompts: Decoupled, stage-specific—e.g., “Layout Detection:”, “Text Recognition:”, “Table Recognition:”.
- Training Corpus: Over 6.9M pretraining samples spanning layout, text, formula, and table blocks. Auto-annotation/refinement leverages prior MinerU2, Qwen2.5-VL-72B, UniMERNet, and a dedicated QA revision loop. Final fine-tuning set includes 630K hard-case, IMIC-minted annotations.
3. Algorithms, Mathematical Formulations, and Pseudocode
- Intersection over Union (IoU):
- Overlap Resolution:
1 2 3 4 5 6 7 8 9
def resolve_overlaps(bboxes): filtered=[] for A in bboxes: if not any(contains(B,A) for B in bboxes if B!=A): filtered.append(A) for A,B in combinations(filtered,2): if IoU(A,B)>0 and A.type=='text' and B.type=='text': A, B = shrink_boxes(A,B) return filtered
- Reading-Order Grouping:
1 2
def group_by_reading_order(bboxes): return sorted(bboxes, key=lambda b: (b.y1, b.x1))
- MinerU2.5 Losses:
- Layout: with .
- Recognition:
4. Performance Benchmarks
MinerU Baselines (Layout, Formula)
| Model | mAP | AP50 | AR50 | Comments |
|---|---|---|---|---|
| DocXchain | 52.8 | 69.5 | 77.3 | Academic Papers |
| LayoutLMv3-SFT (MinerU) | 77.6 | 93.3 | 95.5 | Academic Papers, highest across metrics |
| Model | AP50 | AR50 | Comments |
|---|---|---|---|
| Pix2Text-MFD | 60.1 | 64.6 | Academic Papers |
| YOLOv8-FT (MinerU) | 87.7 | 89.9 | Highest for formula detection |
| Model | CDM | Comments |
|---|---|---|
| Pix2tex | 0.636 | Formula recognition |
| UniMERNet | 0.968 | MinerU, best result |
MinerU2.5 (End-to-End, Efficiency)
| Model | Params | Overall↑ | TextEdit↓ | FormulaCDM↑ | TableTEDS↑ | TableTEDS-S↑ | OrderEdit↓ |
|---|---|---|---|---|---|---|---|
| MonkeyOCR-p-3B | 3.7 B | 88.85 | 0.075 | 87.25 | 86.78 | 90.63 | 0.128 |
| MinerU2.5 | 1.2B | 90.67 | 0.047 | 88.46 | 88.22 | 92.38 | 0.044 |
| Model | Params | Pages/s | Hardware |
|---|---|---|---|
| dots.ocr | 3.0 B | 0.28 | A100 80G |
| MonkeyOCR-p | 3.7 B | 0.47 | A100 80G |
| MinerU2.5 | 1.2B | 2.12 | A100 80G |
MinerU2.5 achieves higher accuracy across all tasks with significantly lower parameter count and >4× throughput relative to contemporary 3–4B general-purpose systems. Ablations indicate decoupling reduces total FLOPs by approximately 10× compared to monolithic native-resolution VLMs.
5. Integration and Customization
- API and CLI Flexibility
- Configuration is achievable through YAML files or command-line overrides (e.g.,
enable_ocr,languages,drop_headers/footers, model selection for tables/formulas). - Example (Python):
1 2 3
from mineru import MinerU parser = MinerU(enable_ocr=True, languages=['en','zh'], drop_headers=True, drop_footers=True) result = parser.parse('sample.pdf')
- Example (CLI):
1
mineru -i sample.pdf -o output.md --format markdown --lang en --drop-footers --enable-table-crop
- Configuration is achievable through YAML files or command-line overrides (e.g.,
- Batch and Pipeline Integration
- Batch processing via simple Python loops or the provided multiprocessing loader.
- Embedding in Retrieval-Augmented Generation (RAG) pipelines is facilitated by Markdown output.
- Custom Callbacks and Postprocessing
- User-defined overlap-resolution can be set via
parser.set_overlap_handler(my_fn). - For OCR-heavy (text-based) PDFs,
only_api_extraction=Trueaccelerates performance by avoiding redundant region detection.
- User-defined overlap-resolution can be set via
- Hardware Requirements
- GPUs are required for UniMERNet and YOLOv8; LayoutLMv3 and PaddleOCR operate on CPU.
6. Algorithmic and Design Innovations
MinerU and MinerU2.5 exemplify design decisions that emphasize both quality and computational efficiency:
- Fine-tuned models and annotation strategies (including IMIC-driven hard case mining and AI-augmented QA).
- Stage decoupling (global structure on thumbnails; local recognition on high-res crops) in MinerU2.5 yields substantial compute savings for negligible accuracy trade-off—e.g., pixel-unshuffle reduces Stage I FLOPs by ∼25% for <0.1% loss in F1.
- Multi-model, modular architecture allows tailored deployment according to use case and resource availability.
A plausible implication is that such modular, decoupled approaches are likely to be further generalized in document intelligence, as the trade-off between accuracy and inference cost becomes increasingly relevant in large corpus and cloud-scale processing.
7. Impact, Limitations, and Prospects
MinerU establishes open-source, high-fidelity content extraction for the research and enterprise ecosystem, setting new SOTA baselines in layout, formula, and table recognition. Its robust design is well-suited for academic, educational, and archival material digitization, as well as downstream LLM integration. Limitations include the necessity of a GPU for peak recognition accuracy on complex layouts and relatively narrow language support. Future directions could include expanding language coverage and joint learning strategies to further reduce resource consumption without sacrificing accuracy.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free