Decoupled Attention Network for Text Recognition (1912.10205v1)

Published 21 Dec 2019 in cs.CV

Abstract: Text recognition has attracted considerable research interests because of its various applications. The cutting-edge text recognition methods are based on attention mechanisms. However, most of attention methods usually suffer from serious alignment problem due to its recurrency alignment operation, where the alignment relies on historical decoding results. To remedy this issue, we propose a decoupled attention network (DAN), which decouples the alignment operation from using historical decoding results. DAN is an effective, flexible and robust end-to-end text recognizer, which consists of three components: 1) a feature encoder that extracts visual features from the input image; 2) a convolutional alignment module that performs the alignment operation based on visual features from the encoder; and 3) a decoupled text decoder that makes final prediction by jointly using the feature map and attention maps. Experimental results show that DAN achieves state-of-the-art performance on multiple text recognition tasks, including offline handwritten text recognition and regular/irregular scene text recognition.

Authors (8)

Tianwei Wang (6 papers)
Yuanzhi Zhu (21 papers)
Lianwen Jin (116 papers)
Canjie Luo (20 papers)
Xiaoxue Chen (22 papers)
Yaqiang Wu (12 papers)
Mingxiang Cai (4 papers)
QianYing Wang (27 papers)

Citations (236)

View on Semantic Scholar

Summary

Decoupled Attention Network for Text Recognition

The paper "Decoupled Attention Network for Text Recognition" by Tianwei Wang et al. presents an innovative approach to text recognition by proposing a decoupled attention network (DAN). The traditional attention mechanisms used in text recognition rely heavily on historical decoding results for alignment, which often leads to misalignment issues, particularly in tasks involving long sequences of text. This paper introduces a solution that decouples the alignment operation from the reliance on these historical decoding results, aiming to improve robustness and accuracy.

Methodology

The proposed Decoupled Attention Network (DAN) consists of three main components:

Feature Encoder: Based on convolutional neural networks (CNN), the feature encoder extracts visual features from input images.
Convolutional Alignment Module (CAM): This module performs alignment operations using the visual features extracted by the feature encoder, independent of historical decoding results. It employs a fully convolutional network (FCN) to generate attention maps, effectively replacing traditional score-based recurrence alignment methods.
Decoupled Text Decoder: Leveraging both the feature map and attention maps, this decoder independently makes the final predictions about the text without the influence of historical decoding. It utilizes a gated recurrent unit (GRU) to facilitate sequence prediction.

Experimental Results

DAN demonstrates superior performance across various text recognition tasks, including offline handwritten text recognition and both regular and irregular scene text recognition. Key findings from the experiments include:

Handwritten Text Recognition: On the IAM dataset, DAN achieved a Character Error Rate (CER) of 6.4%, significantly improving upon previous attention-based methods. It also recorded a robust performance on the RIMES dataset, with a relative reduction in Word Error Rate (WER) by 29%.
Scene Text Recognition: The network showed competitive or superior results across multiple datasets, such as IIIT5K, IC03, IC13, and CUTE80. Notably, the 2D form of DAN (denoted as DAN-2D) excelled in handling irregular text, which is common in scene text scenarios.

Implications and Future Work

The decoupling of the attention mechanism provides significant improvements in handling long text sequences and irregular texts, reducing the common misalignments faced by traditional methods. This has practical implications for improving text recognition systems’ robustness and accuracy, particularly in complex real-world applications such as OCR for handwritten documents or dynamic and diverse textual content in natural scenes.

Future developments may explore further refinement of the component modules, particularly the CAM, to enhance alignment accuracy even under challenging conditions, such as distorted text or complex backgrounds. Additionally, adapting and testing the DAN framework in diverse application areas beyond the datasets explored could provide further insights into its versatility and efficacy.

Conclusion

The Decoupled Attention Network represents an advancement in text recognition technology, addressing critical limitations of existing attention-based approaches. Through decoupling alignment from decoding history, DAN enhances both the flexibility and robustness of text recognition systems, establishing a new state-of-the-art in various recognition tasks. The insights and techniques set forth in this research pave the way for continued innovation in the field of text recognition, with broad potential applications.

PDF Markdown

Related Papers

Find Related Papers