MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm (2506.05218v1)

Published 5 Jun 2025 in cs.CV

Abstract: We introduce MonkeyOCR, a vision-LLM for document parsing that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline (as in MinerU's modular approach) and avoids the inefficiencies of processing full pages with giant end-to-end models (e.g., large multimodal LLMs like Qwen-VL). In SRR, document parsing is abstracted into three fundamental questions - "Where is it?" (structure), "What is it?" (recognition), and "How is it organized?" (relation) - corresponding to layout analysis, content identification, and logical ordering. This focused decomposition balances accuracy and speed: it enables efficient, scalable processing without sacrificing precision. To train and evaluate this approach, we introduce the MonkeyDoc (the most comprehensive document parsing dataset to date), with 3.9 million instances spanning over ten document types in both Chinese and English. Experiments show that MonkeyOCR outperforms MinerU by an average of 5.1%, with particularly notable improvements on challenging content such as formulas (+15.0%) and tables (+8.6%). Remarkably, our 3B-parameter model surpasses much larger and top-performing models, including Qwen2.5-VL (72B) and Gemini 2.5 Pro, achieving state-of-the-art average performance on English document parsing tasks. In addition, MonkeyOCR processes multi-page documents significantly faster (0.84 pages per second compared to 0.65 for MinerU and 0.12 for Qwen2.5-VL-7B). The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.

Summary

The paper introduces MonkeyOCR, a vision-language model using a Structure-Recognition-Relation triplet paradigm for precise document parsing.
It decomposes parsing into structure detection, content recognition, and relation prediction to minimize error propagation and reduce computational load.
Evaluated on the comprehensive MonkeyDoc dataset and OmniDocBench, MonkeyOCR outperforms larger models in accuracy and speed across multiple document types.

Document parsing, the process of converting unstructured document content into machine-readable structured information, is crucial for applications ranging from automated workflows to digital archiving. Existing approaches face significant challenges: pipeline-based methods suffer from error propagation through sequential stages (e.g., MinerU (2409.18839), Marker), while end-to-end models (e.g., Qwen2.5-VL (2502.13923)) struggle with the computational cost of processing high-resolution, multi-modal pages efficiently.

The paper introduces MonkeyOCR, a novel vision-LLM designed to address these limitations through a Structure-Recognition-Relation (SRR) triplet paradigm. This approach decomposes document parsing into three fundamental steps:

Structure Detection: Identifying and localizing different semantic regions ("Where is it?"), such as text blocks, tables, formulas, and images.
Content Recognition: Extracting the content within each detected region ("What is it?").
Relation Prediction: Determining the logical reading order and structural relationships between regions ("How is it organized?").

This decomposition balances the modularity of pipeline methods with the benefits of joint modeling for recognition, aiming for higher accuracy and efficiency without accumulating errors from distinct single-task models.

Implementation Details of the SRR Paradigm:

Structure Detection: This stage uses a YOLO-based model (similar to DocLayout-YOLO (2410.12628)) to process the input document image. It outputs a set of bounding boxes for identified regions and their corresponding types (e.g., text, table, formula, figure).
1 2 3
Input: Document Image I (H x W x 3) Model: YOLO-based Detector Output: Bounding Boxes B = {b_1, ..., b_n}, Types T = {t_1, ..., t_n}

Block-level Content Recognition: Each detected region defined by a bounding box

b_i

is cropped from the original image (

I_{\text{crop}^i}

). These cropped regions are then processed in parallel by a unified Large Multimodal Model (LMM). A type-specific prompt

p_{t_i}

is provided to the LMM along with the cropped image to guide the recognition process for that specific content type (e.g., transcribe text, recognize LaTeX for formulas, extract HTML for tables).

1
2
3

Input: Cropped Regions {I_crop^1, ..., I_crop^n}, Type Prompts {p_t1, ..., p_tn}
Model: Unified LMM
Output: Structured Content C = {c_1, ..., c_n} (e.g., transcribed text, LaTeX string, HTML table)

This block-level processing is key to efficiency, as it avoids feeding the entire high-resolution page into the LMM for recognition, reducing context length and computational cost.

Relation Prediction: The bounding boxes $B$ $B$ are input to a dedicated model (potentially building on techniques like LayoutReader (2011.13534) or XY-Cut++ (2504.10258)) to predict the logical reading order sequence $S = \{s_1, s_2, \dots, s_n\}$ $S = {s_{1}, s_{2}, \dots, s_{n}}$ , where $s_i$ $s_{i}$ is the index of the $i$ $i$ -th block in the reading sequence. The recognized content $C$ $C$ is then assembled according to this sequence to produce the final structured document output $D$ $D$ .
1 2 3 4
Input: Bounding Boxes B = {b_1, ..., b_n} Model: Reading Order Predictor Output: Reading Sequence S = {s_1, ..., s_n} Final Output: D = {c_s1, c_s2, ..., c_sn}

The MonkeyDoc Dataset:

To train and evaluate MonkeyOCR, the authors introduce MonkeyDoc, claimed to be the most comprehensive document parsing dataset to date. It comprises 3.9 million block-level instances across five core tasks (layout detection, reading order, text, table, formula, code block recognition) and over ten document types in both English and Chinese. MonkeyDoc was constructed through a multi-stage pipeline combining filtered public datasets (M\textsuperscript{6}Doc (2011.13534), DocLayNet [28th ACM SIGKDD], PubTabNet [2019]), meticulous manual annotation, programmatic data synthesis (especially for Chinese tables and formulas), and expert model-driven automatic labeling (e.g., using PPOCR (2206.03001) and LayoutReader for reading order). This diverse and large-scale dataset is essential for training a robust and generalizable document parsing model.

Performance and Evaluation:

MonkeyOCR was evaluated on OmniDocBench (2412.07626), a benchmark designed for real-world PDF document parsing across 9 document types and 3 languages.

Overall Performance: MonkeyOCR achieved state-of-the-art overall performance on both English and Chinese tasks compared to several pipeline tools (MinerU, Marker, Mathpix) and expert/general VLMs (GOT-OCR, Nougat, GPT4o, Qwen2.5-VL, InternVL3).
Task-Specific Performance: It showed significant improvements on challenging content types like formulas (+15.0% CDM) and tables (+8.6% TEDS) compared to MinerU on average.
Performance Across Document Types: MonkeyOCR demonstrated strong generalization, achieving the best overall performance across the nine diverse document types in OmniDocBench and the highest accuracy in six categories.
Comparison with Larger Models: The 3B-parameter MonkeyOCR model outperformed much larger models like Qwen2.5-VL-72B (72B parameters) by 7.4% and even the closed-source Gemini 2.5-Pro by 0.8% on English document parsing tasks, highlighting its efficiency and effectiveness. While slightly trailing Gemini 2.5-Pro on Chinese documents, a specialized version, MonkeyOCR*, showed further improvements in Chinese parsing.
Efficiency: MonkeyOCR showed competitive or superior inference speed, processing 0.84 pages per second for multi-page documents compared to MinerU's 0.65 and Qwen2.5-VL-7B's 0.12.

Implementation Considerations:

Computational Requirements: The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU by integrating with tools like LMDeploy [2023]. Training requires significant resources (53 hours on 32 A800 GPUs for the 3B model), indicative of the need for substantial hardware for training large LMMs and detection models.
Modularity: The SRR structure allows for potential updates or replacements of individual components (detector, LMM, relation predictor) as better models become available, offering flexibility compared to monolithic end-to-end systems.
Parallelism: The block-level content recognition stage is inherently parallelizable, contributing to the system's efficiency, especially for documents with many regions.
Dataset Dependence: Performance relies heavily on the quality and diversity of the training data, underscoring the importance of the MonkeyDoc dataset. The specialized MonkeyOCR* demonstrates the benefit of data tailored to specific language characteristics (Chinese layout).

Real-World Applications:

MonkeyOCR's ability to accurately and efficiently parse diverse document types and content (text, tables, formulas) in both English and Chinese makes it suitable for numerous real-world applications:

Automated Business Workflows: Extracting information from invoices, forms, reports, etc.
Digital Archiving: Converting physical or image-based documents into structured, searchable formats.
Intelligent Education: Parsing textbooks, exam papers, and lecture slides, including complex formulas and diagrams.
Medical Record Management: Structuring patient records for easier analysis and retrieval.

The release of code and models will facilitate the practical implementation and application of MonkeyOCR in these domains.

PDF Markdown

GitHub

GitHub - Yuliang-Liu/MonkeyOCR: A lightweight LMM-based Document Parsing Model (105 stars)

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm (2506.05218v1)

Summary

Related Papers

GitHub