Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution (2102.06732v1)

Published 24 Jan 2021 in cs.CV, cs.AI, cs.IR, cs.LG, and cs.MM

Abstract: Visual information extraction (VIE) has attracted considerable attention recently owing to its various advanced applications such as document understanding, automatic marking and intelligent education. Most existing works decoupled this problem into several independent sub-tasks of text spotting (text detection and recognition) and information extraction, which completely ignored the high correlation among them during optimization. In this paper, we propose a robust visual information extraction system (VIES) towards real-world scenarios, which is a unified end-to-end trainable framework for simultaneous text detection, recognition and information extraction by taking a single document image as input and outputting the structured information. Specifically, the information extraction branch collects abundant visual and semantic representations from text spotting for multimodal feature fusion and conversely, provides higher-level semantic clues to contribute to the optimization of text spotting. Moreover, regarding the shortage of public benchmarks, we construct a fully-annotated dataset called EPHOIE (https://github.com/HCIILAB/EPHOIE), which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE consists of 1,494 images of examination paper head with complex layouts and background, including a total of 15,771 Chinese handwritten or printed text instances. Compared with the state-of-the-art methods, our VIES shows significant superior performance on the EPHOIE dataset and achieves a 9.01% F-score gain on the widely used SROIE dataset under the end-to-end scenario.

Authors (9)

Jiapeng Wang (22 papers)
Chongyu Liu (12 papers)
Lianwen Jin (116 papers)
Guozhi Tang (8 papers)
Jiaxin Zhang (105 papers)
Shuaitao Zhang (5 papers)
Yaqiang Wu (12 papers)
Mingxiang Cai (4 papers)
QianYing Wang (27 papers)

Citations (73)

View on Semantic Scholar

Summary

Towards Robust Visual Information Extraction in Real World: A New Dataset and Novel Solution

The paper under review introduces a comprehensive approach to Visual Information Extraction (VIE) in real-world scenarios. This method leverages a robust system, termed VIES, which is designed to tackle the challenges associated with complex document layouts and diverse text forms. The innovative aspect of this work lies in its unified end-to-end trainable framework, which integrates the sub-tasks of text detection, recognition, and information extraction into a singular, cohesive process. This integration is achieved through the introduction of two novel coordination mechanisms: the Vision Coordination Mechanism (VCM) and the Semantics Coordination Mechanism (SCM).

The authors highlight a significant limitation in previous VIE systems, which typically approach the problem in a decoupled manner, separating text spotting from information extraction. Such an approach fails to exploit the interconnected nature of these tasks fully. The VIES framework, on the other hand, utilizes multimodal feature fusion to optimize all aspects of the process simultaneously, thereby enhancing overall accuracy and efficiency.

A notable contribution of this paper is the introduction of the EPHOIE dataset. This dataset serves as the first Chinese benchmark for both text spotting and visual information extraction and is characterized by its real-world examination paper head images that contain handwritten and printed text instances. The dataset includes 1,494 images with a total of 15,771 annotated text instances, providing a challenging test case due to its complex layouts and noisy backgrounds.

The empirical results presented in the paper demonstrate the effectiveness of VIES. The system achieved a 9.01% improvement in F-score on the SROIE dataset under the end-to-end scenario, indicating a substantial performance gain over state-of-the-art methods. This improvement underscores the efficacy of the end-to-end optimization and the introduction of multimodal feature integration in VIES.

An interesting dimension of the proposed method is the adaptive feature fusion module (AFFM), which effectively combines features from various sources and levels, notably segment-level and token-level information, to generate robust representations. This approach enhances the system's ability to handle diverse data and varying input complexities.

The research implications of this work are considerable, as it provides a structured pathway for developing more integrated and efficient VIE systems. It highlights the importance of leveraging the interconnectedness of various sub-tasks within VIE for improved performance. The presented dataset further contributes to the field by setting a benchmark for future works focusing on similar challenges in Chinese document processing.

In terms of practical implications, VIES can potentially be used in advanced applications such as automated document filing, intelligent educational tools, and automated marking systems. These applications could greatly benefit from the improved accuracy and efficiency provided by the end-to-end integration of text spotting and information extraction tasks.

Future research could explore the extension of the VIES framework to additional languages and script types, assessing the system's adaptability and efficacy across different cultural and regional document types. Further exploration into the optimization of VCM and SCM could yield even greater performance improvements in the domain of document understanding.

In conclusion, the work presents a well-structured solution to the challenges faced in visual information extraction, backed by thorough empirical validation and the release of a significant new dataset. The novel approach of integrating multiple sub-tasks in an end-to-end framework marks an advancement in the field, providing a foundation for future research and development efforts in VIE systems.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - HCIILAB/EPHOIE (97 stars)