MonkeyOCR v1.5: Unified Document Parsing

Updated 20 November 2025

MonkeyOCR v1.5 is a unified vision-language framework for robust document parsing that efficiently handles complex layouts with multi-level tables and embedded content.
It leverages a two-stage divide-and-conquer pipeline combined with a visual-consistency-based reinforcement learning scheme to optimize table parsing.
Empirical evaluations on OmniDocBench v1.5 show state-of-the-art accuracy and superior performance over existing OCR methods in visually challenging scenarios.

MonkeyOCR v1.5 is a unified vision-language framework for robust document parsing, designed to address intricate real-world document layouts with multi-level tables, embedded images or formulas, and cross-page structures. It employs a two-stage “divide-and-conquer” pipeline, leveraging a large multimodal model for global layout understanding and reading order prediction, followed by specialized region-level content recognition. Notable technical advances include a visual-consistency-based reinforcement learning (RL) scheme for complex table parsing, and two modules—Image-Decoupled Table Parsing (IDTP) and Type-Guided Table Merging (TGTM)—for handling embedded images and multi-segment tables. This framework achieves state-of-the-art performance on the OmniDocBench v1.5 benchmark and demonstrates exceptional robustness in visually complex scenarios (Zhang et al., 13 Nov 2025).

1. Two-Stage Unified Parsing Pipeline

MonkeyOCR v1.5 utilizes a two-stage pipeline to decouple the global understanding of document structure from the fine-grained recognition of content within localized regions:

Stage I: Layout and Reading-Order Prediction
- Bounding box $(x_1, y_1, x_2, y_2)$
- Reading order index $i$
- Category $c \in \{\text{text}, \text{formula}, \text{table}\}$
- Rotation angle $\alpha$

The generation follows:

$p_\theta(y \mid I, p_\text{layout}) = \prod_{t=1}^T p_\theta(y_t \mid y_{<t}, I, p_\text{layout})$

With constrained decoding enforcing:

$y_t = \{\text{bbox}:(x_1, y_1, x_2, y_2),\, \text{index}:i,\, \text{label}:c,\, \text{rotation}:\alpha\}$

Stage II: Region-Level Content Recognition For each detected region $y_i=(\mathrm{bbox}_i, \text{label}_i, \alpha_i)$ $y_{i} = (bbox_{i}, label_{i}, α_{i})$ :
1. Crop and rotate:
$I_i = \mathrm{Rotate}(\mathrm{Crop}(I, \mathrm{bbox}_i), \alpha_i)$

Type-specific recognition:

$\hat c_i = \begin{cases} \mathcal R_\text{text}(I_i), & \text{if label}_i=\text{text} \ \mathcal R_\text{formula}(I_i), & \text{if label}_i=\text{formula} \ \mathcal R_\text{table}(I_i), & \text{if label}_i=\text{table} \end{cases}$
Merging for document output:

$\hat Y_\text{doc} = \mathrm{Merge}(\hat c_{\pi(1)}, \ldots, \hat c_{\pi(N)})$

where $\pi$ denotes ascending index order. This separation enables the VLM to perform both layout reasoning and robust, localized recognition with minimized error propagation.

2. Visual-Consistency-Based Reinforcement Learning for Table Parsing

Parsing complex table structures in real-world documents traditionally requires expensive manual annotation. MonkeyOCR v1.5 introduces a self-supervised RL scheme using “render-and-compare” visual consistency:

Reward Model Training:

Construct positive/negative pairs $(\mathrm{GT}, \hat y)$ via perturbation or error sampling.
Render each candidate HTML $\hat y$ to image $I^\mathcal R$ and compare with the gold-crop $I^\mathcal O$ .
Train a VLM-based binary reward model:

$r_\phi(I^\mathcal O, \hat y, I^\mathcal R) \approx \begin{cases} +1, & \text{visually consistent} \ -1, & \text{inconsistent} \end{cases}$

Policy Optimization (GRPO):

The supervised fine-tuned policy $\pi_\theta(y \mid x)$ is improved over dataset $\mathcal D$ (labeled + unlabeled table crops), maximizing

$J(\theta) = \mathbb E_{x \sim \mathcal D} \mathbb E_{y \sim \pi_\theta(\cdot|x)} [r_\phi(x, y)]$

with policy gradient:

$\nabla_\theta J = \mathbb E_{x, y}\left[ \nabla_\theta\log\pi_\theta(y|x)\, r_\phi(x, y)\right]$

This RL-based refinement steers the parser toward table structures where rendered HTML aligns visually with the observed tables, avoiding the need for additional annotation (Zhang et al., 13 Nov 2025).

3. Specialized Modules: IDTP and TGTM

MonkeyOCR v1.5 addresses two longstanding challenges in document parsing via dedicated modules:

Image-Decoupled Table Parsing (IDTP):

Many tables contain embedded images that disrupt text-centric parsers. IDTP operates by: 1. Detecting embedded images in table regions (via YOLOv10). 2. Replacing each figure with a precise placeholder $<\mathrm{IMG}_k>$ , tracking the image–mask mapping. 3. Running the masked table through the VLM table parser, yielding output HTML with <img id="k"/> tokens. 4. Post-processing reinserts the original images at their positions.

This decoupling ensures the integrity of textual recognition and layout structure, regardless of non-textual cell contents.

Type-Guided Table Merging (TGTM):
- Pattern 1: Duplicate headers are dropped and bodies concatenated.
- Patterns 2 and 3: A BERT-based row-boundary classifier determines if row splits warrant cell merging or straightforward concatenation.
- Throughout, column indices are realigned, headings normalized, and span conflicts resolved.

These modules enable parsing of tables that extend over multiple pages/columns and contain images, a limitation in previous OCR systems.

4. Empirical Results and Comparative Analysis

MonkeyOCR v1.5 is empirically validated on OmniDocBench v1.5 and public tables benchmarks, utilizing the following metrics:

OmniDocBench v1.5 metrics:
- Overall accuracy (%)
- Text edit distance (lower is better)
- Formula correctness (Tree-Edit Distance on Math, CDM)
- Table accuracy (TEDS/TEDS-s; higher is better)
- Read-order edit score
Comparison Table (Table 4 from the report):

Model	Overall ↑	Text⁽Edit⁾ ↓	Formula⁽CDM⁾ ↑	Table⁽TEDS⁾ ↑	TEDS-s ↑	ReadOrder⁽Edit⁾ ↓
PPOCR-VL	91.9	0.039	88.7	91.0	94.9	0.048
MinerU 2.5	90.7	0.047	88.5	88.2	92.4	0.044
MonkeyOCR v1.5 (Ours)	92.9	0.046	91.2	92.0	95.0	0.049

Further, on the OCRFlux-Complex subset, MonkeyOCR v1.5 attains 90.9% table recognition accuracy, outperforming both PPOCR-VL and MinerU 2.5 by ~9.2%.

Qualitatively, the system maintains lowest edit distances on dense, multi-column newspaper layouts, reconstructs tables with embedded images accurately, and preserves continuity in cross-page tables (Zhang et al., 13 Nov 2025).

5. Technical and Conceptual Contributions

Key technical and conceptual advancements of MonkeyOCR v1.5 include:

Streamlined, VLM-based two-stage pipeline for simultaneous layout and reading-order prediction with region-level recognition.
Visual-consistency-based RL leveraging unlabeled data for table parsing, obviating the expense of new ground-truth annotation.
Specialized modules (IDTP, TGTM) for reliable parsing of tables with embedded images and the reconstruction of cross-segment tables.
Superior performance on challenging document benchmarks compared to existing OCR baselines, particularly in visually intricate scenarios.

These contributions mark a substantial advance in document parsing for complex, real-world layouts.

6. Limitations and Prospective Extensions

The technical report does not enumerate explicit limitations. A plausible implication is that latency and multimodality coverage could be practical concerns. Suggested future directions include:

Extending the visual-consistency RL framework to formulas and mixed-content document regions.
Reducing inference latency to enable real-time applications.
Scaling the system to support additional languages and highly ornate layouts.
Investigating deeper end-to-end integration of global and local parsing stages (Zhang et al., 13 Nov 2025).

In summary, MonkeyOCR v1.5 advances the frontier in vision-language document parsing, particularly in handling complex tabular structures and visually multifaceted documents.

PDF Markdown Chat (Pro)

References (1)

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns (2025)

Follow Topic

Get notified by email when new papers are published related to MonkeyOCR v1.5.