Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks (2004.07464v3)

Published 16 Apr 2020 in cs.CV

Abstract: Computer vision with state-of-the-art deep learning models has achieved huge success in the field of Optical Character Recognition (OCR) including text detection and recognition tasks recently. However, Key Information Extraction (KIE) from documents as the downstream task of OCR, having a large number of use scenarios in real-world, remains a challenge because documents not only have textual features extracting from OCR systems but also have semantic visual features that are not fully exploited and play a critical role in KIE. Too little work has been devoted to efficiently make full use of both textual and visual features of the documents. In this paper, we introduce PICK, a framework that is effective and robust in handling complex documents layout for KIE by combining graph learning with graph convolution operation, yielding a richer semantic representation containing the textual and visual features and global layout without ambiguity. Extensive experiments on real-world datasets have been conducted to show that our method outperforms baselines methods by significant margins. Our code is available at https://github.com/wenwenyu/PICK-pytorch.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Wenwen Yu (16 papers)
  2. Ning Lu (88 papers)
  3. Xianbiao Qi (38 papers)
  4. Ping Gong (12 papers)
  5. Rong Xiao (44 papers)
Citations (126)

Summary

PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks

The paper presents a novel approach to Key Information Extraction (KIE) from documents, leveraging a framework named PICK. This method tackles the prevalent challenge of integrating textual and visual features within document layouts to enhance information extraction precision. This research is critical given the complexity of document layouts and the need for robust methods that can handle this intricacy effectively.

Core Contributions

  1. Integration of Graph Learning: PICK incorporates an improved graph learning-convolutional network designed to learn a soft adjacent matrix that effectively captures relationships between nodes (text segments) without manually predefined connections. This innovation refines graph structure dynamically, allowing for more accurate feature propagation.
  2. Robust Utilization of Textual and Visual Features: Unlike many traditional KIE methods that focus solely on textual attributes, PICK concurrently exploits visual features such as images, layouts, and positional data. This duality is key to achieving a richer semantic representation necessary for precise KIE, as it mitigates issues associated with ambiguous entity extraction.
  3. Enhanced Graph Convolution Operations: Graph convolutional networks (GCNs) in PICK are applied to extract and propagate non-local and non-sequential features, thus leveraging document layout and visual cues. The improved GCN model eschews conventional reliance on fully connected graphs, enhancing performance by selectively aggregating relevant node information.

Experimental Results

The efficacy of the PICK framework is substantiated through extensive experiments on datasets with diverse layouts: Medical Invoice, Train Ticket, and SROIE. Key findings are:

  • Medical Invoice Dataset: PICK outperforms baseline methods, achieving an average mEF improvement of 14.7%. This demonstrates the framework’s capability in exploiting visual cues like font color, critically enhancing entity extraction, such as invoice numbers.
  • Train Ticket Dataset: Achieving near-perfect mEF scores, PICK excels in fixed layout scenarios, highlighting its robustness and adaptability in handling structured document layouts.
  • SROIE Dataset: Competitive results against models using larger training datasets, suggesting PICK’s efficient use of available data for effective extraction.

Implications and Future Work

The research provides a significant contribution to the domain of document information extraction by demonstrating a method that intricately uses both text and visual aspects. PICK's ability to dynamically learn graph structures presents a promising direction for developing adaptable and scalable KIE solutions. Furthermore, the integration of graph learning with convolutional networks sets a foundation for future research aiming to refine document analysis techniques.

Future developments could focus on enhancing scalability and efficiency for real-time applications. Additionally, extending this framework to accommodate multilingual documents or exploring integration with larger pre-trained models like LayoutLM could further advance the state-of-the-art in KIE practices.

In summary, PICK represents a significant step towards sophisticated document understanding systems by bridging the gap between textual and visual information through advanced graph-based learning methodologies.