PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition (2207.14807v1)

Published 29 Jul 2022 in cs.CV

Abstract: Handwritten Chinese text recognition (HCTR) has been an active research topic for decades. However, most previous studies solely focus on the recognition of cropped text line images, ignoring the error caused by text line detection in real-world applications. Although some approaches aimed at page-level text recognition have been proposed in recent years, they either are limited to simple layouts or require very detailed annotations including expensive line-level and even character-level bounding boxes. To this end, we propose PageNet for end-to-end weakly supervised page-level HCTR. PageNet detects and recognizes characters and predicts the reading order between them, which is more robust and flexible when dealing with complex layouts including multi-directional and curved text lines. Utilizing the proposed weakly supervised learning framework, PageNet requires only transcripts to be annotated for real data; however, it can still output detection and recognition results at both the character and line levels, avoiding the labor and cost of labeling bounding boxes of characters and text lines. Extensive experiments conducted on five datasets demonstrate the superiority of PageNet over existing weakly supervised and fully supervised page-level methods. These experimental results may spark further research beyond the realms of existing methods based on connectionist temporal classification or attention. The source code is available at https://github.com/shannanyinxiang/PageNet.

Citations (21)

View on Semantic Scholar

Summary

The paper introduces PageNet, a weakly supervised, bottom-up framework for page-level HCTR that utilizes character detection, recognition, and reading order prediction to handle complex document layouts.
PageNet demonstrates superior performance on various datasets, matching or exceeding fully supervised methods while significantly reducing the need for extensive character-level annotations.
This work provides a feasible solution for efficient handwritten Chinese text recognition, reducing annotation burdens and showing potential for broader application across diverse languages and document types.

Overview of PageNet: Towards End-to-End Weakly Supervised Page-Level Handwritten Chinese Text Recognition

The paper introduces PageNet, an innovative approach for page-level handwritten Chinese text recognition (HCTR) under a weakly supervised learning framework. PageNet diverges from conventional methodologies, proposing a bottom-up procedure that focuses on character detection, recognition, and reading order prediction. This method addresses complex multilevel challenges, including curved text lines and multidirectional reading orders, prevalent in realistic document images.

Methodology

PageNet is composed of several modules: a backbone network for feature extraction, a detection and recognition module, a reading order module, and a graph-based decoding algorithm. The detection and recognition module utilizes three branches to output character-level bounding boxes, distribution maps, and classification probabilities. The reading order module, on the other hand, surmounts conventional left-to-right reading constraints by predicting the reading sequence through start-of-line identification, four-directional movement within grids, and end-of-line signaling.

The backbone network processes input images, generating high-level feature maps with reduced dimensionality, essential for its downstream applications. Subsequently, the detection and recognition module refines these feature maps through distinct branches, producing well-localized and classified character grids.

The graph-based decoding algorithm integrates outputs from previous modules into coherent line- and character-level results. This algorithm constructs a graph model where nodes correspond to detected characters and edges elucidate reading sequences established by the reading order module. This comprehensive framework facilitates the transformation of raw, unordered grid predictions into structured textual outputs, attesting to PageNet's capacity to effectively decode complex text layouts.

Weakly Supervised Learning Framework

A significant challenge in HCTR is the labor-intensive task of providing comprehensive annotations—specifically, bounding boxes for each character within texts. PageNet mitigates this by leveraging a weakly supervised framework that requires only line-level transcripts as annotations. Utilizing synthetic data in conjunction with real samples, this framework implements a robust loop of matching, updating pseudo-labels, and optimizing model parameters.

Through semantic and spatial matching algorithms, PageNet fingerpoints accurately predicted characters, updating pseudo-labels iteratively. This strategy eliminates reliance on costly annotations, generating character-level bounding boxes that are sufficiently accurate for practical applications.

Experimental Evaluation

PageNet's capabilities were evaluated across multiple datasets, including CASIA-HWDB, ICDAR2013, and other diverse datasets representing various document types and layouts. Experiments indicated that PageNet achieves superior performance, comparable to or exceeding fully supervised methods despite operating under weak supervision. Notably, its proficiency extends across handwritten and printed text, demonstrating remarkable flexibility and generalization.

The experimental results showcased PageNet's effectiveness in maintaining high accuracy even when processing multilingual texts (e.g., English and Chinese) and addressing challenging reading orders through multidirectional and curved lines. Additional experiments confirmed that the quality of pseudo-labels generated during training matches or surpasses manually annotated data, offering potential applications in automatic document annotation.

Implications and Future Research

PageNet represents a significant advancement in HCTR, bringing forth a framework that reduces annotation burdens while enhancing the capabilities of page-level recognition systems. This work paves the way for broader application across varied textual structures and document types, not only in Chinese but potentially extending to diverse linguistic corpora.

Future research might explore expanding PageNet's abilities to seamlessly integrate LLMs or address complex layouts common across historical and degraded documents, further enhancing its effectiveness. Moreover, investigating its application in live scenarios and dynamic textual content could streamline the development of automated reading systems for educational, legal, and archival purposes.

In summary, PageNet proposes a feasible solution for realistic and efficient handwritten Chinese text recognition, endowing AI with the capability to handle complex document images with reduced supervision. This paper lays foundational insights into the challenges and capabilities inherent to page-level text recognition, endorsing continued innovations within the domain of artificial intelligence.

Related Papers

GitHub

GitHub - shannanyinxiang/PageNet (81 stars)