- The paper introduces MonkeyOCR, a vision-language model using a Structure-Recognition-Relation triplet paradigm for precise document parsing.
- It decomposes parsing into structure detection, content recognition, and relation prediction to minimize error propagation and reduce computational load.
- Evaluated on the comprehensive MonkeyDoc dataset and OmniDocBench, MonkeyOCR outperforms larger models in accuracy and speed across multiple document types.
Document parsing, the process of converting unstructured document content into machine-readable structured information, is crucial for applications ranging from automated workflows to digital archiving. Existing approaches face significant challenges: pipeline-based methods suffer from error propagation through sequential stages (e.g., MinerU (2409.18839), Marker), while end-to-end models (e.g., Qwen2.5-VL (2502.13923)) struggle with the computational cost of processing high-resolution, multi-modal pages efficiently.
The paper introduces MonkeyOCR, a novel vision-LLM designed to address these limitations through a Structure-Recognition-Relation (SRR) triplet paradigm. This approach decomposes document parsing into three fundamental steps:
- Structure Detection: Identifying and localizing different semantic regions ("Where is it?"), such as text blocks, tables, formulas, and images.
- Content Recognition: Extracting the content within each detected region ("What is it?").
- Relation Prediction: Determining the logical reading order and structural relationships between regions ("How is it organized?").
This decomposition balances the modularity of pipeline methods with the benefits of joint modeling for recognition, aiming for higher accuracy and efficiency without accumulating errors from distinct single-task models.
Implementation Details of the SRR Paradigm:
- Structure Detection: This stage uses a YOLO-based model (similar to DocLayout-YOLO (2410.12628)) to process the input document image. It outputs a set of bounding boxes for identified regions and their corresponding types (e.g., text, table, formula, figure).
1
2
3
|
Input: Document Image I (H x W x 3)
Model: YOLO-based Detector
Output: Bounding Boxes B = {b_1, ..., b_n}, Types T = {t_1, ..., t_n} |
- Block-level Content Recognition: Each detected region defined by a bounding box bi is cropped from the original image (Icropi). These cropped regions are then processed in parallel by a unified Large Multimodal Model (LMM). A type-specific prompt pti is provided to the LMM along with the cropped image to guide the recognition process for that specific content type (e.g., transcribe text, recognize LaTeX for formulas, extract HTML for tables).
1
2
3
|
Input: Cropped Regions {I_crop^1, ..., I_crop^n}, Type Prompts {p_t1, ..., p_tn}
Model: Unified LMM
Output: Structured Content C = {c_1, ..., c_n} (e.g., transcribed text, LaTeX string, HTML table) |
This block-level processing is key to efficiency, as it avoids feeding the entire high-resolution page into the LMM for recognition, reducing context length and computational cost.
- Relation Prediction: The bounding boxes B are input to a dedicated model (potentially building on techniques like LayoutReader (2011.13534) or XY-Cut++ (2504.10258)) to predict the logical reading order sequence S={s1,s2,…,sn}, where si is the index of the i-th block in the reading sequence. The recognized content C is then assembled according to this sequence to produce the final structured document output D.
1
2
3
4
|
Input: Bounding Boxes B = {b_1, ..., b_n}
Model: Reading Order Predictor
Output: Reading Sequence S = {s_1, ..., s_n}
Final Output: D = {c_s1, c_s2, ..., c_sn} |
The MonkeyDoc Dataset:
To train and evaluate MonkeyOCR, the authors introduce MonkeyDoc, claimed to be the most comprehensive document parsing dataset to date. It comprises 3.9 million block-level instances across five core tasks (layout detection, reading order, text, table, formula, code block recognition) and over ten document types in both English and Chinese. MonkeyDoc was constructed through a multi-stage pipeline combining filtered public datasets (M\textsuperscript{6}Doc (2011.13534), DocLayNet [28th ACM SIGKDD], PubTabNet [2019]), meticulous manual annotation, programmatic data synthesis (especially for Chinese tables and formulas), and expert model-driven automatic labeling (e.g., using PPOCR (2206.03001) and LayoutReader for reading order). This diverse and large-scale dataset is essential for training a robust and generalizable document parsing model.
Performance and Evaluation:
MonkeyOCR was evaluated on OmniDocBench (2412.07626), a benchmark designed for real-world PDF document parsing across 9 document types and 3 languages.
- Overall Performance: MonkeyOCR achieved state-of-the-art overall performance on both English and Chinese tasks compared to several pipeline tools (MinerU, Marker, Mathpix) and expert/general VLMs (GOT-OCR, Nougat, GPT4o, Qwen2.5-VL, InternVL3).
- Task-Specific Performance: It showed significant improvements on challenging content types like formulas (+15.0% CDM) and tables (+8.6% TEDS) compared to MinerU on average.
- Performance Across Document Types: MonkeyOCR demonstrated strong generalization, achieving the best overall performance across the nine diverse document types in OmniDocBench and the highest accuracy in six categories.
- Comparison with Larger Models: The 3B-parameter MonkeyOCR model outperformed much larger models like Qwen2.5-VL-72B (72B parameters) by 7.4% and even the closed-source Gemini 2.5-Pro by 0.8% on English document parsing tasks, highlighting its efficiency and effectiveness. While slightly trailing Gemini 2.5-Pro on Chinese documents, a specialized version, MonkeyOCR*, showed further improvements in Chinese parsing.
- Efficiency: MonkeyOCR showed competitive or superior inference speed, processing 0.84 pages per second for multi-page documents compared to MinerU's 0.65 and Qwen2.5-VL-7B's 0.12.
Implementation Considerations:
- Computational Requirements: The 3B model can be efficiently deployed for inference on a single NVIDIA 3090 GPU by integrating with tools like LMDeploy [2023]. Training requires significant resources (53 hours on 32 A800 GPUs for the 3B model), indicative of the need for substantial hardware for training large LMMs and detection models.
- Modularity: The SRR structure allows for potential updates or replacements of individual components (detector, LMM, relation predictor) as better models become available, offering flexibility compared to monolithic end-to-end systems.
- Parallelism: The block-level content recognition stage is inherently parallelizable, contributing to the system's efficiency, especially for documents with many regions.
- Dataset Dependence: Performance relies heavily on the quality and diversity of the training data, underscoring the importance of the MonkeyDoc dataset. The specialized MonkeyOCR* demonstrates the benefit of data tailored to specific language characteristics (Chinese layout).
Real-World Applications:
MonkeyOCR's ability to accurately and efficiently parse diverse document types and content (text, tables, formulas) in both English and Chinese makes it suitable for numerous real-world applications:
- Automated Business Workflows: Extracting information from invoices, forms, reports, etc.
- Digital Archiving: Converting physical or image-based documents into structured, searchable formats.
- Intelligent Education: Parsing textbooks, exam papers, and lecture slides, including complex formulas and diagrams.
- Medical Record Management: Structuring patient records for easier analysis and retrieval.
The release of code and models will facilitate the practical implementation and application of MonkeyOCR in these domains.