Image-Decoupled Table Parsing (IDTP)

Updated 5 December 2025

The paper demonstrates that decoupling visual content from textual table structure yields improved structural fidelity, as seen in TEDS improvements from TableFormer and MonkeyOCR v1.5.
IDTP is a methodology that separates image localization (via masking and object detection) from structure decoding, ensuring clear boundaries and accurate HTML reconstruction.
The approach employs multi-task loss functions and reinforcement learning techniques to optimize both cell detection and image placeholder alignment for robust table parsing.

Image-Decoupled Table Parsing (IDTP) refers to a class of methodologies that structurally and algorithmically "decouple" the visual parsing of table cell content and embedded non-textual elements (such as images or figures) from the extraction and modeling of table structure. The fundamental objective is to prevent non-textual visual cues from interfering with the recognition of cell boundaries, structural tokens, and textual cell contents. Recent implementations, notably in TableFormer and MonkeyOCR v1.5, have realized the IDTP paradigm with marked quantitative and qualitative advances, particularly in robustly handling complex tables with embedded images, diverse layouts, and variable content modalities (Nassar et al., 2022, Zhang et al., 13 Nov 2025).

1. Motivation and Problem Definition

Tables in document images may contain text, figures, or photographs intermingled within structured rows and columns. Traditional table parsing architectures process the entire table region as a single image, so the features from embedded non-textual regions can interfere with both cell detection and table-structure decoding. The result is degraded boundary detection, spurious or missing cells, dropped images, and impaired HTML structure recovery. IDTP addresses this by explicitly decoupling image/figure content from the sequence modeling and structure inference pipeline. In MonkeyOCR v1.5, this is achieved through detection and masking of embedded figures prior to structure decoding, ensuring that HTML generation focuses solely on the structural/textual properties, with deferred re-integration of extracted images for accurate restoration (Zhang et al., 13 Nov 2025).

2. Core Architectural Paradigms

Both TableFormer and MonkeyOCR v1.5 instantiate IDTP by introducing clear architectural separation between modules handling visual (cell and/or image) localization and those responsible for structure prediction.

TableFormer couples a ResNet-18 based CNN encoder for image feature extraction with two downstream decoders: a transformer-based structure decoder (that emits tokenized HTML tags) and an object-detection decoder aligned with HTML <td> tokens. The structure decoder is strictly responsible for emitting table structure, while the cell detector regresses bounding boxes and "empty vs. non-empty" labels for each cell, based on hidden states at <td> token positions (Nassar et al., 2022).
MonkeyOCR v1.5 incorporates an explicit IDTP module as a data preparation step. YOLOv10 is used to localize embedded image regions within a table crop. Detected regions are masked out (zeroed), producing a masked image $I^m$ input to a Vision Transformer (ViT)-based Vision–LLM for table-HTML decoding. The HTML decoder's output includes explicit placeholder tokens (<img id=j>) for image cells. Original images are reinserted in post-processing, guaranteeing fidelity and perfect image retention (Zhang et al., 13 Nov 2025).

The high-level workflow for IDTP in MonkeyOCR v1.5 can be summarized as:

Step	Component	Functionality
Figure Localization	YOLOv10	Detect bounding boxes of embedded images within the table crop
Image Decoupling & Input	Masking / Mapping	Mask detected image cells in $I^o$ to produce $I^m$ ; record crop-to-ID mappings
Structure Decoding	Vision–LLM (ViT+Tr)	Decode $I^m$ into HTML with placeholder tokens for images
Image Reintegration	Post-processing	Replace each `<img id=j>` token with the exact original image patch per recorded mapping

3. Mathematical Formulation and Loss Functions

IDTP frameworks are trained to jointly optimize structural and localization objectives, with additional auxiliary losses in IDTP-aware systems.

TableFormer introduces a multi-task loss with cross-entropy over HTML tags and a composite box regression loss:

$\ell = \lambda\,\ell_{s} + (1-\lambda)\,\ell_{\text{box}}$

where $\ell_{s}$ is the cross-entropy loss for structure tokens, and

$\ell_{\text{box}} = \lambda_{\text{iou}} \cdot \ell_{\text{iou}} + \lambda_{l1} \cdot \|b_{\text{pred}}-b_{\text{gt}}\|_1$

combines generalized-IoU and L1 box regression. An explicit Tree-Edit-Distance Score (TEDS) metric is used:

$\text{TEDS} = 1 - \frac{\text{EditDist}(T_{\text{pred}},T_{\text{gt}})}{\max(|T_{\text{pred}}|, |T_{\text{gt}}|)}$

measuring structural similarity at the HTML tag tree level (Nassar et al., 2022).

MonkeyOCR IDTP adds an auxiliary cross-entropy loss focused on correct placement of image placeholder tokens and, optionally, an alignment loss to render the predicted HTML and compare structural masks:

$L_{\text{total}} = L_{\text{struct}}(\hat{y},y^*) + \lambda\,L_{\text{img}}(\hat{y},y^*) + \mu\,L_{\text{align}}(I^o,\hat{y})$

Here, $L_{\text{img}}$ penalizes misplacement of <img id=j> tokens, thereby reinforcing correct alignment of image objects within the structure. $L_{\text{align}}$ optionally penalizes discrepancies between rendered structure and the true table boundary map (Zhang et al., 13 Nov 2025).

Visual-consistency reinforcement learning is further used to optimize global structure+image consistency by maximizing expected reward under a separately trained reward model, leveraging a generalized RPO loop to close the perception-structure gap.

4. Data Integration and Preprocessing

Successful operationalization of IDTP depends on the availability of paired structural (HTML) and cell-level or image-level location supervision. TableFormer leverages programmatic PDF sources and datasets (PubTabNet, FinTabNet), where ground-truth (text, cell-box) pairs obviate the need for OCR and provide precise targets for the bounding box decoder. In cases like TableBank with incomplete bounding box annotations, missing bboxes are inferred or the table is pruned if reconstruction is ambiguous, ensuring strict one-to-one correspondence between <td> tokens and targets (Nassar et al., 2022). This approach enables training the object-detection decoder exclusively on supervised, non-OCR targets.

MonkeyOCR v1.5 extends the paradigm to complex, real-world cases by integrating automatic figure localization (YOLOv10) and placeholder-based mapping of image content to structure. The masking and mapping is purely a data-preparation step; no structural decoder ever receives the original images as inputs, enforcing architectural decoupling (Zhang et al., 13 Nov 2025).

5. Experimental Results and Quantitative Impact

Both TableFormer and MonkeyOCR v1.5 demonstrate substantial quantitative advances attributable to IDTP, specifically in structural fidelity and end-to-end content+structure accuracy.

TableFormer reports:

Dataset	EDD (LSTM) TEDS	TableFormer TEDS
PubTabNet Simple	91.1%	98.5%
PubTabNet Complex	88.7%	95.0%
FinTabNet Simple	88.4%	97.5%
FinTabNet Complex	92.1%	96.0%

Further improvements in mean average precision (mAP) for bounding box detection and end-to-end TEDS including cell text are also reported. Notably, integrating TableFormer’s explicit Cell BBox Decoder into EDD yields a 3–4 point mAP improvement, demonstrating the benefit of decoupled object detection for cell localization (Nassar et al., 2022).

MonkeyOCR v1.5, equipped with IDTP, achieves:

Benchmark	MinerU2.5 TEDS	PaddleOCR-VL TEDS	MonkeyOCR v1.5 TEDS
PubTabNet	89.1%	85.2%	90.7%
OCRFlux-Complex (figures)	81.7%	81.7%	90.9%

On OCRFlux-Complex, which contains many embedded figures and heterogeneous cell types, IDTP yields a +9.2 point improvement over baselines. Only MonkeyOCR and PaddleOCR-VL claim full support for embedded-image restoration, with qualitative results showing perfect image re-insertion (Zhang et al., 13 Nov 2025).

6. Integration in Document Understanding Pipelines

IDTP is typically embedded as a modular subroutine within broader document parsing and intelligence pipelines. In MonkeyOCR v1.5, IDTP operates as an intermediate between region detection and vision-language decoding. The workflow is:

Layout analysis produces table regions.
IDTP detects image cells, masks them, and runs the masked image through the VLM for HTML table structure decoding (including image placeholders).
Extracted image patches are re-inserted into the HTML according to the output IDs.
Downstream, a Type-Guided Table Merging module merges partial tables split across pages or columns by analyzing header and row continuation.

This modularity ensures IDTP can be adapted or extended for other document intelligence systems capable of region-level detection and structured sequence modeling (Zhang et al., 13 Nov 2025).

7. Limitations and Future Directions

IDTP's reliance on accurate figure localization and precise mapping between image regions and HTML placeholders introduces sensitivity to detection errors—missed or mislocalized images may not be restorable. The architecture presupposes datasets with sufficient annotation and clear mapping between structure and content. A plausible implication is that further gains could be realized by developing end-to-end networks that internally learn both structure-image associations and are robust to incomplete or noisy supervision.

The success of decoupling relies on a modular approach and loss functions that penalize both structure and content misalignment. As the complexity and modality diversity of real-world tables grows (e.g., with formulas, charts, non-rectangular layouts), research into joint vision-language methods and improved reinforcement learning strategies (such as visual consistency reward modeling) may further advance the state-of-the-art.