Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition (1811.00751v2)

Published 2 Nov 2018 in cs.CV

Abstract: Recognizing irregular text in natural scene images is challenging due to the large variance in text appearance, such as curvature, orientation and distortion. Most existing approaches rely heavily on sophisticated model designs and/or extra fine-grained annotations, which, to some extent, increase the difficulty in algorithm implementation and data collection. In this work, we propose an easy-to-implement strong baseline for irregular scene text recognition, using off-the-shelf neural network components and only word-level annotations. It is composed of a $31$-layer ResNet, an LSTM-based encoder-decoder framework and a 2-dimensional attention module. Despite its simplicity, the proposed method is robust and achieves state-of-the-art performance on both regular and irregular scene text recognition benchmarks. Code is available at: https://tinyurl.com/ShowAttendRead

PDF Abstract

A Novel Baseline for Irregular Text Recognition: Show, Attend and Read

The paper "Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition" presents a novel method for text recognition in the challenging domain of irregularly shaped text within natural images. Unlike traditional Optical Character Recognition (OCR) systems which are primarily designed for regular, horizontal text, the method proposed by Li et al. shifts focus towards addressing the variability in text appearance emanating from curvature, orientation, and distortion.

Methodology

The proposed methodology is notable for its employment of off-the-shelf neural network components, yielding a strong yet straightforward baseline for irregular scene text recognition. The architecture is built using a 31-layer ResNet for feature extraction, an LSTM-based encoder-decoder framework, and a two-dimensional (2D) attention module. This combination leverages word-level annotations alone, eschewing the need for fine-grained character-level labels, underscoring its simplicity and ease of implementation.

The backbone of the system is the encoder-decoder framework augmented by a tailored 2D attention mechanism. This module allows for the localization of individual characters during the decoding process without character-level annotations, effectively handling the spatial complexities inherent in irregular text.

Empirical Evaluation

Empirical results underscore the method's robustness, showing state-of-the-art performance on various benchmarks for irregular text recognition such as the ICDAR 2015 dataset, with a marked improvement over existing systems. On regular text datasets, the proposed approach demonstrates competitive results, further highlighting its versatility.

A notable aspect of this model is its end-to-end training capability, achieved without pre-training, enabling the full utilization of both real and synthetic data for training. The paper provides substantial empirical results that reinforce the efficacy of the approach. The inclusion of both synthetic and real datasets during the training process is pivotal to the system’s adaptability and robustness across diverse text conditions.

Comparative Analysis

The paper effectively contrasts its approach with prevailing methods in the field, such as those employing rectification strategies that transform images before recognition and those utilizing multi-directional encoding. These methods, while innovative, often face challenges with severely distorted or curved text. The proposed 2D attention-based method circumvents these issues by refraining from image transformation, allowing it to maintain flexibility and robustness when encountering irregular text patterns.

Additionally, the method's independence from character-level annotations is highlighted as a significant improvement over approaches that require auxiliary tasks or complex frameworks to achieve attention mechanistic supervision.

Future Directions

The paper suggests multiple avenues for future research. These include potential adaptations of the architecture using CNNs in place of LSTMs to further streamline the training process, explorations into more intricate graph structures as part of the attention mechanism for improved context capture, and enhancements integrating classification objectives to speed up and refine training processes.

Conclusion

This paper makes substantive contributions to the field of scene text recognition, specifically for irregular text, through its introduction of a simple yet potent architecture based on available neural components. It furnishes the community with an effective and less data-intensive approach to a problem that is increasingly pertinent with the proliferation of visually and contextually rich multimedia content. Future work is encouraged to build upon these findings, exploring more sophisticated implementations and additional functionalities that may further enhance its application scope.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Hui Li (1004 papers)
Peng Wang (831 papers)
Chunhua Shen (404 papers)
Guyu Zhang (1 paper)

Citations (352)

View on Semantic Scholar