Image-based table recognition: data, model, and evaluation (1911.10683v5)

Published 25 Nov 2019 in cs.CV

Abstract: Important information that relates to a specific topic in a document is often organized in tabular format to assist readers with information retrieval and comparison, which may be difficult to provide in natural language. However, tabular data in unstructured digital documents, e.g., Portable Document Format (PDF) and images, are difficult to parse into structured machine-readable format, due to complexity and diversity in their structure and style. To facilitate image-based table recognition with deep learning, we develop the largest publicly available table recognition dataset PubTabNet (https://github.com/ibm-aur-nlp/PubTabNet), containing 568k table images with corresponding structured HTML representation. PubTabNet is automatically generated by matching the XML and PDF representations of the scientific articles in PubMed Central Open Access Subset (PMCOA). We also propose a novel attention-based encoder-dual-decoder (EDD) architecture that converts images of tables into HTML code. The model has a structure decoder which reconstructs the table structure and helps the cell decoder to recognize cell content. In addition, we propose a new Tree-Edit-Distance-based Similarity (TEDS) metric for table recognition, which more appropriately captures multi-hop cell misalignment and OCR errors than the pre-established metric. The experiments demonstrate that the EDD model can accurately recognize complex tables solely relying on the image representation, outperforming the state-of-the-art by 9.7% absolute TEDS score.

Citations (182)

View on Semantic Scholar

Summary

The paper introduces PubTabNet, a benchmark dataset with 568,000 table images that supports robust training for table recognition.
The paper proposes an Encoder-Dual-Decoder model that achieves a 9.7% improvement in TEDS score by separately decoding table structure and cell content.
The paper introduces the TEDS metric, which accurately evaluates table structure and content by capturing multi-hop cell misalignment and OCR errors.

Image-based Table Recognition: Data, Model, and Evaluation

This paper presents a comprehensive paper on image-based table recognition, focusing on a robust approach using deep learning techniques. The authors introduce PubTabNet, a substantial dataset containing 568,000 table images and their corresponding HTML representations, which was automatically generated from the PubMed Central Open Access Subset (PMCOA).

The paper addresses three core components involved in table recognition:

Data

PubTabNet offers a significant contribution to table recognition by providing a large-scale, diverse set of tables sourced from over 6,000 journals. This dataset excels in diversity and complexity, with tables represented in HTML format, allowing for integration into web applications. The training set was meticulously curated by filtering out erroneous annotations and removing inconsistencies, ensuring high-quality data for model development.

Model

The authors propose a novel Encoder-Dual-Decoder (EDD) architecture, which consists of an encoder, a structure decoder, and a cell decoder. This model uniquely utilizes two decoders to independently handle the table structure and cell content recognition, a departure from standard single-decoder approaches. The EDD model demonstrates superiority by achieving a 9.7% improvement in TEDS score over existing methods. By strategically using attention mechanisms, the EDD architecture effectively captures visual features and supports accurate table reconstruction.

Evaluation

A new Tree-Edit-Distance-based Similarity (TEDS) metric is introduced to evaluate table recognition performance. TEDS successfully addresses the limitations of previous metrics, capturing multi-hop cell misalignment and OCR errors. The metric operates at a tree-structured level, providing a comprehensive assessment of both structure and content accuracy. This approach was validated through perturbation experiments demonstrating its sensitivity and robustness.

Implications and Future Directions

The practical implications of this research span various domains, particularly where structured data extraction from documents is critical, such as legal, healthcare, and financial sectors. Theoretically, the introduction of a dual-decoder architecture paves the way for further exploration into multi-task learning within computer vision tasks, setting a precedent for future model designs and performance benchmarks in AI.

The authors propose future work focusing on expanding PubTabNet to include cell coordinate annotations, which would enhance the model’s capabilities to predict cell locations. Additionally, integrating table detection networks with the EDD model could facilitate end-to-end systems for comprehensive table detection and recognition. This research underscores the continuous evolution of table recognition technology and its potential applications across various industries.

PDF Markdown

Related Papers

GitHub

GitHub - ibm-aur-nlp/PubTabNet (375 stars)