MASTER: Multi-Aspect Non-local Network for Scene Text Recognition (1910.02562v3)

Published 7 Oct 2019 in cs.CV

Abstract: Attention-based scene text recognizers have gained huge success, which leverages a more compact intermediate representation to learn 1d- or 2d- attention by a RNN-based encoder-decoder architecture. However, such methods suffer from attention-drift problem because high similarity among encoded features leads to attention confusion under the RNN-based local attention mechanism. Moreover, RNN-based methods have low efficiency due to poor parallelization. To overcome these problems, we propose the MASTER, a self-attention based scene text recognizer that (1) not only encodes the input-output attention but also learns self-attention which encodes feature-feature and target-target relationships inside the encoder and decoder and (2) learns a more powerful and robust intermediate representation to spatial distortion, and (3) owns a great training efficiency because of high training parallelization and a high-speed inference because of an efficient memory-cache mechanism. Extensive experiments on various benchmarks demonstrate the superior performance of our MASTER on both regular and irregular scene text. Pytorch code can be found at https://github.com/wenwenyu/MASTER-pytorch, and Tensorflow code can be found at https://github.com/jiangxiluning/MASTER-TF.

PDF Abstract

Overview of MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

The paper presents the MASTER model, a novel approach for scene text recognition that incorporates a multi-aspect non-local network framework. This method addresses several challenges associated with traditional attention-based scene text recognizers, particularly the "attention-drift" problem and inefficiencies in RNN-based architectures due to poor parallelization. MASTER utilizes a self-attention mechanism, which encodes both input-output attention and self-attention within the encoder and decoder, leading to improved robustness against spatial distortions and enhanced training efficiency.

Key Contributions

Multi-Aspect Global Context Attention (MA-GCAttention): The paper introduces a novel multi-aspect non-local block integrated into the encoder. This approach models global contexts, capturing different spatial feature dependencies. The multi-aspect mechanism acts akin to multi-head self-attention, offering diverse spatial attention insights.
Efficient Memory-Cache Decoding: The proposal includes a memory-cache based decoding strategy, augmenting inference speed by caching intermediate results and minimizing redundant computations during the inference phase.
Performance Strengths: The method demonstrates state-of-the-art performance across several benchmarks, particularly excelling in both regular and irregular scene text recognition tasks.

Experimental Results

The MASTER model significantly outperforms existing methods in numerous standard benchmarks relative to accuracy and efficiency. Notable results include:

Accuracy Improvements: Achieves near top performance in datasets such as COCO-Text, particularly in case-sensitive error metrics, highlighting its robust handling of complex scenes and irregular text shapes.
Speed Enhancements: The adoption of parallel training and memory-cache based decoding results in superior training and inference times compared to traditional RNN-based models like SAR.

Theoretical and Practical Implications

From a theoretical perspective, the incorporation of multi-aspect self-attention addresses the limitations of conventional attention mechanisms by providing a more nuanced understanding of spatial dependencies. Practically, the MASTER model's enhanced efficiency suggests significant potential for deployment in real-time applications, further facilitating advancements in fields such as robotic process automation, autonomous vehicle systems, and other applications requiring robust text recognition under challenging conditions.

Future Directions

Anticipated future developments in AI based on this research may explore deeper integrations of non-local attention mechanisms across various domains beyond text recognition. Additionally, extending this framework to accommodate multi-lingual or domain-specific text variations could broaden its applicability further.

In conclusion, the MASTER model stands as a significant advancement in the domain of scene text recognition by effectively combining robustness, accuracy, and computational efficiency, marking a step forward in addressing the challenges of contemporary text recognition systems.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Ning Lu (88 papers)
Wenwen Yu (16 papers)
Xianbiao Qi (38 papers)
Yihao Chen (40 papers)
Ping Gong (12 papers)
Rong Xiao (44 papers)
Xiang Bai (221 papers)

Citations (146)

View on Semantic Scholar

Related Papers

Find Related Papers