Focusing Attention: Towards Accurate Text Recognition in Natural Images (1709.02054v3)

Published 7 Sep 2017 in cs.CV

Abstract: Scene text recognition has been a hot research topic in computer vision due to its various applications. The state of the art is the attention-based encoder-decoder framework that learns the mapping between input images and output sequences in a purely data-driven way. However, we observe that existing attention-based methods perform poorly on complicated and/or low-quality images. One major reason is that existing methods cannot get accurate alignments between feature areas and targets for such images. We call this phenomenon "attention drift". To tackle this problem, in this paper we propose the FAN (the abbreviation of Focusing Attention Network) method that employs a focusing attention mechanism to automatically draw back the drifted attention. FAN consists of two major components: an attention network (AN) that is responsible for recognizing character targets as in the existing methods, and a focusing network (FN) that is responsible for adjusting attention by evaluating whether AN pays attention properly on the target areas in the images. Furthermore, different from the existing methods, we adopt a ResNet-based network to enrich deep representations of scene text images. Extensive experiments on various benchmarks, including the IIIT5k, SVT and ICDAR datasets, show that the FAN method substantially outperforms the existing methods.

PDF Abstract

An Analysis of the Focusing Attention Network for Scene Text Recognition

The paper "Focusing Attention: Towards Accurate Text Recognition in Natural Images" presents a novel approach to address the limitations of existing attention-based encoder-decoder frameworks in scene text recognition. This area remains a critical challenge within computer vision, primarily due to the variable quality and complexity of images encountered in natural settings. The authors identify a significant issue termed "attention drift," where the alignments between feature areas and target texts become inaccurate, degrading the performance of traditional attention mechanisms.

Key Contributions

The authors introduce the Focusing Attention Network (FAN), which integrates a novel focusing mechanism into the existing architecture of attention networks. The primary components of FAN include:

Attention Network (AN): This component follows the traditional role in an encoder-decoder framework, functioning to recognize character targets. The attention mechanism computes alignment factors that determine how focus is distributed across different image features during text recognition.
Focusing Network (FN): The novel contribution in FAN, this module aims to correct the attention drift experienced by the AN. The FN evaluates whether AN's attention regions are correctly aligned with target text areas. It employs a mechanism to adjust the attention by predicting the correct representation over the derived attention regions. This rectification process ensures that the focus remains true to the target regions, which is especially beneficial for complex or distorted images.
ResNet-based CNN Encoder: Different from prior implementations, a ResNet-based convolutional neural network (CNN) is used for generating deeper and more robust image representations, effectively contributing to improved model precision.

Experimental Validation

The FAN method was subjected to rigorous evaluation using widely recognized benchmarks, including IIIT5k, SVT, and various ICDAR datasets. Results consistently demonstrate that FAN outperforms existing methods by substantial margins across most scenarios. For instance, under lexicon-free conditions, FAN achieves an accuracy of 87.4% on IIIT5k, significantly higher than other state-of-the-art methods. Furthermore, the use of total normalized edit distance as a metric corroborates FAN's superior accuracy in text recognition.

Implications and Future Prospect

Practically, the implementation of FAN marks a significant progression in the field of scene text recognition by enhancing robustness to noise, distortion, and diverse text representations in natural images. Theoretically, the concept of attention drift and its mitigation through a focusing network might open avenues for related research areas such as speech recognition where similar alignment issues could be addressed.

Looking forward, extending the concepts introduced by FAN to other domains within computer vision, such as object detection or video analysis, could provide interesting pathways. Moreover, the authors highlight the potential application for enhancing text detection mechanisms, speculating that a similar focusing technique could boost detection performance.

In conclusion, the Focusing Attention Network represents a well-constructed step towards more precise text recognition models, tackling longstanding challenges with innovative methodologies and reaffirming the importance of alignment correction in attention-based models.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Zhanzhan Cheng (28 papers)
Fan Bai (38 papers)
Yunlu Xu (18 papers)
Gang Zheng (20 papers)
Shiliang Pu (106 papers)
Shuigeng Zhou (81 papers)

Citations (438)

View on Semantic Scholar

Focusing Attention: Towards Accurate Text Recognition in Natural Images (1709.02054v3)

An Analysis of the Focusing Attention Network for Scene Text Recognition

Key Contributions

Experimental Validation

Implications and Future Prospect

Related Papers