MinerU2.5: Efficient Document Parsing VLM
- MinerU2.5 is a 1.2B-parameter vision-language model that uses a decoupled coarse-to-fine parsing strategy to efficiently extract structured information from complex documents.
- It features a modular architecture with a Vision Encoder, Patch Merger, and Language Decoder to accurately recognize layout, text, formulas, and tables.
- Empirical benchmarks demonstrate that MinerU2.5 outperforms competing models in speed and accuracy, setting new state-of-the-art standards in document parsing.
MinerU2.5 is a 1.2-billion-parameter document parsing vision-LLM (VLM) designed for efficient high-resolution document understanding. It features a decoupled architecture that performs state-of-the-art layout and content recognition with significantly reduced computational overhead. MinerU2.5 employs a two-stage, coarse-to-fine parsing strategy that separates global document layout analysis from fine-grained content recognition, allowing efficient and accurate extraction of structured information from complex, long-form documents spanning multiple modalities including dense text, mathematical formulas, and tables (Niu et al., 26 Sep 2025).
1. Model Architecture
MinerU2.5 consists of three integral modules: the Vision Encoder (~675M parameters), the Patch Merger (~10M parameters), and the Language Decoder (~500M parameters).
- Vision Encoder: Built on NaViT (โPatch โnโ Packโ), it uses window-free multi-head self-attention and 2D-RoPE positional encodings to support arbitrary input aspect ratios and resolutions. An input is decomposed into a sequence of tokens, where is the patch size. The encoder is initialized from Qwen2-VL weights to benefit from pre-existing imageโtext alignment.
- Patch Merger: A 2ร2 pixel-unshuffle operation merges neighboring visual tokens, reducing the token count by a factor of four. The resulting embeddings are projected into the LLM's space via a two-layer MLP.
- Language Decoder: Based on the Qwen2-Instruct LLM backbone, it replaces standard 1D-RoPE with multi-dimensional rotary embeddings (M-RoPE) to enhance resolution invariance for content crops. This head emits both high-level layout specifications (bounding boxes, classes, rotation, reading order) and transcriptions (text, LaTeX, table structures).
During inference, the MLP adaptorโs output merges with language tokens. The joint transformer produces the next token, which may specify text or layout properties. The generation probability follows:
Training employs a compound objective: , with cross-entropy and L1 losses for class, box, rotation, and order prediction, plus a negative log-likelihood for text recognition.
2. Decoupled Two-Stage Parsing Methodology
MinerU2.5 avoids the prohibitive cost of native-resolution global encoding by splitting parsing into two stages:
- Stage I: Global Layout Analysis The entire input image is resized to a fixed thumbnail (1036ร1036 px). The encoder processes this low-resolution image, and the LLM head infers a list of document elements, each defined by bounding box coordinates, category, rotation, and reading-order index. The overall cost is , typically providing a 5โ10ร efficiency gain over native-resolution global encoding.
- Stage II: Fine-Grained Content Recognition Detected regions of interest are cropped from the source image at full resolution (crops restricted to max 2048ร2048 px), and each crop is independently processed for detailed recognitionโtext, formulas, or tables. The total cost is , with . This enables preservation of fine details in localized regions without global processing cost.
Empirical benchmarks demonstrate that on A100 80 GB hardware, MinerU2.5 reaches 2.12 pages/s, significantly exceeding MonkeyOCR-Pro-3B (0.47) and dots.ocr (0.28). Token throughput measurements reflect the same improvement: 2337 tokens/s for MinerU2.5 versus 520 and 311, respectively.
3. Data Engine and Training Pipeline
MinerU2.5 employs a comprehensive data engine capable of curating, generating, and refining large-scale, diverse pretraining and fine-tuning datasets via a closed-loop process:
- Data Curation: Aggregates a vast internal collection of public and commercial PDFs with clustering via page-layout embeddings, stratified by document type and balanced for element diversity and language ratio (Chinese/English).
- Pre-training Dataset: Auto-annotation relies on the prior MinerU2 pipeline; further refinement uses Qwen2.5-VL-72B-Instruct for text, UniMERNet for formulas, and a proprietary table model. Overall, the pre-training corpus consists of ~6.9 million samples spanning layout, text, formula, and table modalities.
- Fine-tuning (IMIC Framework): Iterative Mining via Inference Consistency identifies low-consistency or "hard" samples using PageIoU (layout), CDM (formulas), and TEDS (tables). Hard cases are escalated to AI-assisted annotation (e.g., Gemini-2.5-Pro proposals with human QA via Dingo), emphasizing edge-cases like rotated tables, nested formulas, and dense structures. The final supervised fine-tuning (SFT) set contains ~630,000 exemplars.
Augmentationsโincluding spatial, background, color, and degradation variantsโare performed on-the-fly, with transformations chosen according to element type to enhance robustness.
4. Computational Efficiency and Empirical Validation
MinerU2.5 demonstrates superior computational efficiency and recognition accuracy across benchmarks:
Inference Speed and Memory:
- End-to-end on RTX 4090 48GB: 1.70 pages/s
- A100 80GB: 2.12 pages/s
- H200 141GB: 4.47 pages/s Even baseline, non-optimized throughput (0.95 pages/s; 1045 tokens/s) surpasses contemporary VLMs by 2รโ4ร.
Full-Document Parsing Accuracy:
- OmniDocBench (1355 pages, 9 types): Overall 90.67 vs. 88.85 (MonkeyOCR-Pro-3B), 88.41 (dots.ocr); best-in-class for text (Edit Distance 0.047), formula (CDM 88.46), table (TEDS 88.22), and reading-order (Edit 0.044).
- Ocean-OCR (dense English/Chinese): English Edit Distance 0.033 (best), F1=0.945; Chinese Edit Distance 0.082 (2nd best), F1=0.965 (best); BLEU up to 0.909, METEOR 0.950.
- olmOCR-bench (1402 docs, 7 subsets): Overall 75.2 (vs. 73.6 dots.ocr); ArXiv-Math 76.6 (vs. 72.2); Old-Scans-Math 54.6 (best); Tiny-Text 83.5.
Element-Specific Recognition Tasks:
- Layout Analysis (zero-shot): State-of-the-art Full-Page F1@PageIoU on OmniDocBench, DโดLA, and DocLayNet.
- Table Recognition: FinTabNet TEDS=95.97, TEDS-S=97.61 (SOTA); PubTabNet TEDS=89.07 (2nd); in-house TEDS=71.48 (vs. Gemini-2.5 Pro 69.72).
- Formula Recognition: Wins 4/7 public+private sets measured by CDM (e.g., SCE 96.4, LaTeX-Matrix 90.6).
5. Innovations, Ablations, and Key Contributions
MinerU2.5 introduces several architectural and methodological innovations:
- Decoupled Coarse-to-Fine Parsing: This architecture enables native-resolution fidelity for local recognition while maintaining tractable computational complexity, sidestepping the scaling of traditional end-to-end vision-LLMs.
- Parallelizable "Layout โ Crop โ Recognize" Pipeline: By splitting documents into semantically meaningful subregions, the pipeline streamlines inference, allows for parallelized content extraction, and improves interpretability while reducing hallucination typical in monolithic VLMs.
- Unified Multi-Task Layout Head: Simultaneously predicts position, class, rotation, and reading-order via a single pass, with performance evaluated using the PageIoU metric, which is calibrated to human assessment.
- Specialized Representations: The Formula ADR (โAtomic Decomposition & Recombinationโ) and table OTSL representations mitigate challenges with long-sequence generation and improve handling of complex structures.
- Iterative Mining with AI-Human Feedback Loop: The data engineโs IMIC strategy ensures rapid surfacing, annotation, and assimilation of difficult samples, enabling continuous improvement and robustness in edge cases.
In aggregate, these advancements allow MinerU2.5 to achieve new state-of-the-art accuracy across diverse document parsing benchmarks, while operating at a fraction of the computational cost of larger general-purpose models. This positions MinerU2.5 as an efficient solution for high-fidelity, large-scale structuring of document collections (Niu et al., 26 Sep 2025).