A Multi-Object Rectified Attention Network for Scene Text Recognition (1901.03003v1)

Published 10 Jan 2019 in cs.CV

Abstract: Irregular text is widely used. However, it is considerably difficult to recognize because of its various shapes and distorted patterns. In this paper, we thus propose a multi-object rectified attention network (MORAN) for general scene text recognition. The MORAN consists of a multi-object rectification network and an attention-based sequence recognition network. The multi-object rectification network is designed for rectifying images that contain irregular text. It decreases the difficulty of recognition and enables the attention-based sequence recognition network to more easily read irregular text. It is trained in a weak supervision way, thus requiring only images and corresponding text labels. The attention-based sequence recognition network focuses on target characters and sequentially outputs the predictions. Moreover, to improve the sensitivity of the attention-based sequence recognition network, a fractional pickup method is proposed for an attention-based decoder in the training phase. With the rectification mechanism, the MORAN can read both regular and irregular scene text. Extensive experiments on various benchmarks are conducted, which show that the MORAN achieves state-of-the-art performance. The source code is available.

PDF Abstract

Overview of MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition

The paper introduces the Multi-Object Rectified Attention Network (MORAN), a novel approach for general scene text recognition with a focus on addressing the challenges posed by irregular text formats. The proposed model architecture innovatively integrates a Multi-Object Rectification Network (MORN) with an Attention-Based Sequence Recognition Network (ASRN), aiming to tackle the distortion and varying shapes of scene text.

Key Contributions and Methodologies

Multi-Object Rectification Network (MORN): The rectification network is designed to preprocess images containing irregular text, thereby simplifying the subsequent recognition task. Notably, MORN diverges from traditional affine transformations by omitting geometric constraints and providing a more generalized and flexible rectification method. This flexibility enables MORN to accommodate various forms of text distortion more effectively, enhancing the model's applicability to real-world scenarios.
Attention-Based Sequence Recognition Network (ASRN): Following rectification, ASRN employs a CNN-BLSTM framework appended with an attention-based decoder. This structure allows for precise alignment of the model's outputs with the target sequences, thereby improving prediction accuracy. The proposed Fractional Pickup (FP) method enhances the sensitivity and robustness of the attention mechanism against noise and textual ambiguities by dynamically modifying attention weights during training.
Training Strategy: The paper further introduces a curriculum learning strategy to facilitate more efficient training. This involves a staged optimization approach where MORN and ASRN are initially trained separately before being integrated for joint optimization. This strategy helps mitigate potential interferences that could occur if both networks were optimized simultaneously from the outset.

Experimental Evaluation

Extensive experimentation across a variety of standard benchmarks, including the IIIT5K, SVT, ICDAR2003, and CUTE80 datasets, demonstrates MORAN's superior performance, especially in tasks involving irregular text. The results indicate notable improvements in recognition accuracy compared to existing methodologies, highlighted by state-of-the-art performance on several challenging datasets. The curriculum learning approach and weakly supervised training further underscore MORAN's practicality and adaptability to diverse recognition challenges.

Theoretical and Practical Implications

MORAN's flexible architecture and training methodologies bear significant implications for both theoretical exploration and practical deployment in the fields of text recognition and computer vision. By disentangling the complex task of irregular text recognition into rectification and sequence recognition, this model offers a blueprint for future research to build upon, particularly in the development of detection-agnostic recognition systems. Further, MORAN's ability to operate without detailed pixel or geometric labels is a considerable advantage in applications requiring adaptability to evolving datasets without costly annotation requirements.

Future Directions

Future research could extend MORAN's capabilities to encompass arbitrary-oriented text recognition, which remains a challenging frontier with numerous practical applications. Additionally, integrating MORAN with robust scene text detectors could yield a comprehensive end-to-end system that enhances overall OCR performance on complex real-world scenarios.

In sum, MORAN represents a significant advancement in scene text recognition by effectively addressing the intrinsic challenges posed by irregular text patterns through innovative network design and training strategies.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Canjie Luo (20 papers)
Lianwen Jin (116 papers)
Zenghui Sun (4 papers)

Citations (244)

View on Semantic Scholar