Chargrid: Towards Understanding 2D Documents (1809.08799v1)

Published 24 Sep 2018 in cs.CL, cs.CV, cs.LG, and cs.NE

Abstract: We introduce a novel type of text representation that preserves the 2D layout of a document. This is achieved by encoding each document page as a two-dimensional grid of characters. Based on this representation, we present a generic document understanding pipeline for structured documents. This pipeline makes use of a fully convolutional encoder-decoder network that predicts a segmentation mask and bounding boxes. We demonstrate its capabilities on an information extraction task from invoices and show that it significantly outperforms approaches based on sequential text or document images.

Citations (183)

View on Semantic Scholar

Summary

The paper introduces the chargrid approach, a 2D text mapping that preserves document layout for improved semantic understanding.
It employs a fully convolutional encoder-decoder network to segment documents and accurately extract key-value pairs.
The method significantly outperforms traditional sequential models in extracting structured information from invoices and similar documents.

Chargrid: A New Paradigm for Understanding 2D Documents

The paper "Chargrid: Towards Understanding 2D Documents" presents a novel approach for document understanding by introducing a two-dimensional text representation termed "chargrid." This method aims to preserve and leverage the spatial structure of documents, a feature often overlooked by traditional NLP methods that primarily operate on one-dimensional text sequences. The authors propose a fully convolutional encoder-decoder network architecture to exploit the chargrid representation for tasks such as information extraction, specifically focusing on extracting key-value pairs from invoices.

Core Contributions

Chargrid Representation: The chargrid approach encodes a document by mapping each character to a two-dimensional grid based on its spatial location. This preservation of layout is crucial, especially for structured documents where position and alignment convey semantic meaning.
Document Understanding Pipeline: A generic pipeline is established using a fully convolutional network. This network predicts segmentation masks and bounding boxes, facilitating tasks like information extraction from structured documents.
Strong Performance on Information Extraction: The authors demonstrate that chargrid significantly outperforms traditional NLP models, which typically do not consider 2D document structures, and even surpasses image-only approaches that fail to capture textual semantics effectively.

Numerical Results

The paper provides a quantitative evaluation on an extensive dataset of invoices, showing the superiority of the chargrid representation in extracting complex, layout-dependent fields like vendor addresses and line-item details. Compared to sequential text-based models, chargrid showcases markedly improved accuracy in cases where spatial arrangements are essential for understanding content. While sequential models achieved negative scores in complex fields indicating subpar performance, chargrid maintained positive and more robust accuracy scores.

Implications and Future Directions

The implications of chargrid are multifaceted. Practically, it enhances the capability of extracting structured information from a vast array of document types where conventional NLP methods struggle. Theoretically, it bridges the gap between sequential NLP and visual processing methodologies, providing a robust framework that acknowledges document layout as a pivotal element of document semantics.

Future research should explore extending chargrid to other document types and tasks such as named entity recognition and document classification, as well as blending natural imagery with textual content. Experimentation with word-level grids or word embeddings as opposed to character-level grids could further refine semantic understanding. Additionally, hybrid models integrating image and chargrid representations warrant further exploration, despite the current findings that image data does not significantly enhance performance in the studied task.

Overall, the chargrid paradigm aligns with evolving demands in document processing, especially where the spatial organization of content is non-trivial. This paper sets a foundation for subsequent developments and applications in the field, highlighting a potential shift in how NLP interacts with document layout and structure.