Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 62 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 67 tok/s Pro

Kimi K2 192 tok/s Pro

GPT OSS 120B 430 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing (2509.22186v2)

Published 26 Sep 2025 in cs.CV and cs.CL

Abstract: We introduce MinerU2.5, a 1.2B-parameter document parsing vision-LLM that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.

Summary

The paper demonstrates a novel decoupled two-stage parsing strategy that mitigates token redundancy and improves efficiency in high-resolution document parsing.
It integrates a NaViT vision encoder, Qwen2-Instruct language model, and pixel-unshuffle patch merger for precise layout analysis and fine-grained content recognition.
Experimental evaluations show state-of-the-art performance on benchmarks, with notable improvements in speed and accuracy for tables, formulas, and overall layout detection.

MinerU2.5: A Decoupled Vision-LLM for Efficient High-Resolution Document Parsing

Introduction and Motivation

MinerU2.5 introduces a 1.2B-parameter vision-LLM (VLM) specifically designed for high-resolution document parsing. The model addresses the computational and semantic challenges inherent in document images, which typically exhibit high resolution, dense content, and complex layouts. Existing approaches—either modular pipelines or end-to-end VLMs—suffer from error propagation, hallucination, and severe inefficiency due to token redundancy. MinerU2.5's core innovation is a decoupled, coarse-to-fine two-stage parsing strategy that separates global layout analysis from local content recognition, enabling both high accuracy and computational efficiency.

Model Architecture and Two-Stage Parsing

MinerU2.5's architecture comprises a Qwen2-Instruct LLM (0.5B params), a NaViT vision encoder (675M params) supporting dynamic resolutions and aspect ratios, and a patch merger utilizing pixel-unshuffle for efficient token aggregation. The model leverages M-RoPE for improved positional encoding generalization.

The two-stage parsing strategy is as follows:

Stage I: Layout Analysis The input document is downsampled to $1036 \times 1036$ pixels for rapid, global layout analysis. This stage identifies structural elements (text blocks, tables, formulas, images) with minimal computational cost.
Stage II: Content Recognition Guided by the detected layout, the model crops native-resolution regions from the original document and performs fine-grained recognition (text, table, formula) within these local windows. Crops are limited to $2048 \times 28 \times 28$ pixels to balance detail preservation and efficiency.
Figure 1: MinerU2.5's two-stage framework: global layout analysis on downsampled images, followed by fine-grained content recognition on native-resolution crops.

This decoupling reduces the $\mathcal{O}(N^2)$ token complexity of end-to-end approaches, mitigates hallucination, and allows independent optimization of each stage.

Data Engine and Training Pipeline

MinerU2.5's performance is underpinned by a comprehensive data engine that generates large-scale, high-quality training corpora. The pipeline consists of:

Data Curation: Stratified sampling ensures diversity in layout, document type, element balance, and language.
Pre-training Data Preparation: Automated annotation is refined using expert models for text, tables, and formulas.
Fine-tuning Dataset Construction: The IMIC (Iterative Mining via Inference Consistency) strategy identifies hard cases via stochastic inference consistency metrics (PageIoU, TEDS, CDM), which are then manually curated.
Figure 2: Data engine workflow: curation, automated annotation refinement, and IMIC-driven hard case mining for fine-tuning.

Task Reformulation and Metric Innovations

Layout Analysis

MinerU2.5 introduces a hierarchical tagging system for layout elements, ensuring comprehensive coverage (including headers, footers, page numbers) and fine granularity (distinct tags for code, references, lists). Layout analysis is reformulated as a multi-task problem, jointly predicting position, class, rotation, and reading order.

The PageIoU metric is proposed for layout evaluation, measuring page-level coverage and aligning quantitative scores with qualitative human assessment.

Figure 3: PageIoU metric: page-level coverage aligns with visual inspection, resolving inconsistencies in IoU-based recall.

Formula Recognition

MinerU2.5 employs the ADR (Atomic Decomposition and Recombination) framework, decomposing compound formulas into atomic lines, recognizing each into LaTeX, and structurally recombining the results. This approach addresses the limitations of monolithic formula recognition and reduces structural hallucination.

Figure 4: ADR framework: compound formulas are decomposed, recognized, and recombined for high-fidelity LaTeX output.

Table Recognition

The model adopts a four-stage pipeline: detection, rotation correction, recognition into OTSL (Optimized Table Structure Language), and conversion to HTML. OTSL reduces structural token redundancy and sequence length, improving recognition of complex tables.

Figure 5: Table recognition pipeline: detection, geometric normalization, OTSL generation, and HTML conversion.

Iterative Mining via Inference Consistency (IMIC)

IMIC leverages stochastic inference to identify hard cases—samples with low output consistency across multiple runs—using task-specific metrics. These cases are prioritized for manual annotation, focusing human effort on the most valuable data for model improvement.

Figure 6: IMIC strategy: low-consistency samples in layout, table, and formula recognition are mined for targeted annotation.

Quantitative Evaluation

MinerU2.5 achieves state-of-the-art results across multiple benchmarks:

OmniDocBench: Overall score of 90.67, outperforming MonkeyOCR-pro-3B and dots.ocr. Best-in-class edit distance (0.047), formula CDM (88.46), table TEDS (88.22), and reading order edit distance (0.044).
Ocean-OCR: Lowest edit distance (0.033) and highest F1-score (0.945) on English documents; highest F1-score (0.965) and precision (0.966) on Chinese documents.
olmOCR-bench: Overall score of 75.2, leading in arXiv Math (76.6), Old Scans Math (54.6), and Long Tiny Text (83.5).
Layout Analysis: Top Full Page F1-score@PageIoU across OmniDocBench, D4LA, and DocLayNet.
Table Recognition: SOTA on FinTabNet, OCRBench v2, and in-house benchmarks; competitive on PubTabNet and CC-OCR.
Formula Recognition: SOTA on SCE, LaTeX-80M $^M$ , Fuzzy Math, and Complex datasets; second-best on CPE, HWE, SPE, and Chinese.
Figure 7: MinerU2.5's performance highlights on OmniDocBench: consistently surpasses general-purpose and domain-specific VLMs in text, formula, table, and reading order tasks.

Qualitative Analysis

MinerU2.5 demonstrates robust parsing across diverse document types (academic literature, books, reports, slides, newspapers, magazines), complex tables (rotated, merged cells, borderless, colored, dense), and intricate formulas (multi-line, nested, mixed-language, degraded).

Figure 8: Layout and markdown output for academic literature, books, and textbooks.

Figure 9: Rendered outputs for various types of tables.

Figure 10: Rendered outputs for various types of formulas.

Comparisons with previous versions and other SOTA models show MinerU2.5's superior handling of rotated tables, merged cells, Chinese formulas, multi-line formulas, finer layout detection, and watermark pages.

Figure 11: MinerU2.5 outperforms previous versions in rotated table recognition.

Figure 12: MinerU2.5 excels in formula recognition with Chinese content compared to prior models.

Deployment and Efficiency

MinerU2.5 is deployed via an optimized vLLM-based pipeline, with asynchronous backend and decoupled inference stages to minimize latency. Dynamic sampling penalties are applied based on detected layout types to suppress degenerate repetition. The model achieves 2.12 pages/s and 2337.25 tokens/s on A100 80G, outperforming MonkeyOCR-Pro-3B by $4\times$ and dots.ocr by $7\times$ in throughput.

Implications and Future Directions

MinerU2.5's decoupled architecture and data-centric training pipeline establish a new paradigm for efficient, high-fidelity document parsing. Its ability to rapidly convert unstructured documents into structured data is critical for curating pre-training corpora and enhancing RAG systems. The preservation of semantic integrity in tables, formulas, and layouts positions MinerU2.5 as a foundational tool for next-generation AI applications in knowledge extraction, information retrieval, and multimodal understanding.

Future work may explore further scaling of the data engine, integration with more advanced layout and semantic tagging systems, and adaptation to multilingual and domain-specific document types. The IMIC strategy offers a blueprint for continual model improvement via targeted annotation, potentially generalizable to other structured data extraction tasks.

Conclusion

MinerU2.5 demonstrates that a decoupled, coarse-to-fine VLM architecture, combined with a rigorous data engine and targeted hard case mining, can achieve state-of-the-art document parsing accuracy and efficiency. Its design resolves key trade-offs in high-resolution document understanding and sets a new standard for practical deployment in large-scale, real-world scenarios. The model's contributions extend beyond document parsing, offering substantial benefits for downstream AI systems reliant on structured document data.