Graph Convolution for Multimodal Information Extraction from Visually Rich Documents (1903.11279v1)

Published 27 Mar 2019 in cs.IR, cs.CV, and cs.LG

Abstract: Visually rich documents (VRDs) are ubiquitous in daily business and life. Examples are purchase receipts, insurance policy documents, custom declaration forms and so on. In VRDs, visual and layout information is critical for document understanding, and texts in such documents cannot be serialized into the one-dimensional sequence without losing information. Classic information extraction models such as BiLSTM-CRF typically operate on text sequences and do not incorporate visual features. In this paper, we introduce a graph convolution based model to combine textual and visual information presented in VRDs. Graph embeddings are trained to summarize the context of a text segment in the document, and further combined with text embeddings for entity extraction. Extensive experiments have been conducted to show that our method outperforms BiLSTM-CRF baselines by significant margins, on two real-world datasets. Additionally, ablation studies are also performed to evaluate the effectiveness of each component of our model.

PDF Abstract

Graph Convolution for Multimodal Information Extraction from Visually Rich Documents

The paper "Graph Convolution for Multimodal Information Extraction from Visually Rich Documents" presents a novel approach to information extraction from visually rich documents (VRDs) by leveraging graph convolution networks. VRDs are prevalent in various real-world scenarios, such as purchase receipts and insurance policies, where both textual and visual information is essential to comprehend the document's content. This paper addresses a critical gap in traditional information extraction models like BiLSTM-CRF, which typically fail to incorporate visual context, hence diminishing their effectiveness in extracting structured information from VRDs.

Methodology

The research introduces a graph convolution-based model that combines textual and visual information from documents. The approach involves modeling VRDs as graphs where nodes represent text segments, and edges capture visual dependencies, such as positional relationships between text segments. Node embeddings are constructed using Bi-LSTM networks, capturing textual features, while edge embeddings encode visual features such as distances and relative sizes between text segments. Graph convolution layers are designed to aggregate text and visual features into composite embeddings, which are further integrated with text embeddings for entity extraction using BiLSTM-CRF.

Experimental Insights

The authors conducted extensive experiments using two real-world datasets: Value-Added Tax Invoices (VATI) and International Purchase Receipts (IPR). Performance comparisons with baseline models show that the proposed graph convolution method achieved significant gains in F1 scores across both datasets. The paper highlights that visual features are paramount, providing critical contextual information that enhances extraction capabilities for complex entities. Moreover, they demonstrate that their approach outperforms traditional one-dimensional text sequence methods, affirming the benefit of incorporating both visual and textual cues in VRDs. The ablation studies conducted reveal the significant impact of visual features and show the attention mechanisms effectively highlight salient document aspects.

Implications of Findings

This paper's findings underscore the importance of multimodal information integration for VRDs. By demonstrating robust performance improvements and adaptability across varied document layouts, the proposed graph convolution approach has clear implications for advancing IE systems within practical business applications. The successful application in noisy and diverse templates suggests its potential scalability and effectiveness across industries with complex document processing needs.

Future Directions

For future work, the authors suggest extending the graph convolution framework to additional tasks such as document classification, promising broader impact on document understanding systems. Further exploration of adaptive mechanisms in graph layers may optimize model performance for specific use cases. Integrating additional visual features like font size and color, not covered in this paper, could further refine the model's efficacy. Moreover, the development of techniques tailored to handle dynamic graph structures may enhance graph convolution techniques' generalizability and responsiveness.

In conclusion, this paper provides significant advancements in extracting information from visually rich documents. By leveraging graph convolution networks, the approach offers a promising direction for artificial intelligence applications that require the synthesis of text and visual data, positioning it as a potent tool in tackling the complexities inherent in document intensive domains.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Feiyu Gao (8 papers)
Qiong Zhang (56 papers)
Huasha Zhao (6 papers)
XiaoJing Liu (34 papers)

Citations (170)

View on Semantic Scholar

Graph Convolution for Multimodal Information Extraction from Visually Rich Documents (1903.11279v1)