Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Published 22 Aug 2019 in cs.CV | (1908.08207v1)

Abstract: Unifying text detection and text recognition in an end-to-end training fashion has become a new trend for reading text in the wild, as these two tasks are highly relevant and complementary. In this paper, we investigate the problem of scene text spotting, which aims at simultaneous text detection and recognition in natural images. An end-to-end trainable neural network named as Mask TextSpotter is presented. Different from the previous text spotters that follow the pipeline consisting of a proposal generation network and a sequence-to-sequence recognition network, Mask TextSpotter enjoys a simple and smooth end-to-end learning procedure, in which both detection and recognition can be achieved directly from two-dimensional space via semantic segmentation. Further, a spatial attention module is proposed to enhance the performance and universality. Benefiting from the proposed two-dimensional representation on both detection and recognition, it easily handles text instances of irregular shapes, for instance, curved text. We evaluate it on four English datasets and one multi-language dataset, achieving consistently superior performance over state-of-the-art methods in both detection and end-to-end text recognition tasks. Moreover, we further investigate the recognition module of our method separately, which significantly outperforms state-of-the-art methods on both regular and irregular text datasets for scene text recognition.

Abstract PDF Upgrade to Chat

Citations (566)

View on Semantic Scholar

Summary

The paper introduces an integrated end-to-end network that combines text detection and recognition to handle arbitrary text shapes.
It leverages a modified Mask R-CNN with a Feature Pyramid Network and spatial attentional modules for effective segmentation and recognition.
The model achieves state-of-the-art performance on benchmarks like ICDAR and COCO-Text, excelling especially in curved text detection.

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Introduction

The paper introduces Mask TextSpotter, an innovative framework designed to tackle the challenges of scene text spotting—a task that requires simultaneous text detection and recognition. Unlike traditional approaches that separate detection and recognition, Mask TextSpotter combines these tasks into a single end-to-end network, leveraging semantic segmentation to handle text with arbitrary shapes. This integration aims to overcome the sub-optimal performance often encountered when these tasks are addressed independently.

Model Architecture

Mask TextSpotter relies on a modified Mask R-CNN architecture (Figure 1) that incorporates a Feature Pyramid Network (FPN) backbone, a Region Proposal Network (RPN) for generating text proposals, and a mask branch for text instance and character segmentation. The architecture is unique in its ability to detect and recognize text instances directly from two-dimensional space, employing a spatial attention module to enhance performance. This approach distinguishes itself by handling irregular text shapes, such as curved text, more effectively than sequence-to-sequence models designed for one-dimensional sequences.

Figure 1: Architecture of Mask TextSpotter. The solid arrows mean the data flow both in training and inference period. The dashed arrows in blue and in red indicate the data flow in training stage and inference stage, respectively.

Text Instance and Character Segmentation

The mask branch of Mask TextSpotter includes two critical tasks: text instance segmentation and character segmentation. The network outputs character maps where each pixel is classified into a character class, allowing the model to predict text directly from the segmentation maps. These outputs facilitate the detection of text regions and the subsequent grouping of characters into words.

Spatial Attentional Module (SAM)

The SAM further refines the recognition process. By leveraging spatial attention mechanisms, Mask TextSpotter can globally predict the label sequence of each word irrespective of its shape. This module, illustrated in Figure 2, operates on the two-dimensional feature map directly, ensuring that both local and global text information is utilized effectively.

Figure 2: Architecture of the standalone recognition model. We use a feature-pyramid structure with ResNet-50. Note that both the two modules can provide the recognition results along with their confidence score, we select the final recognition result with a higher confidence score dynamically.

Training and Optimization

The network undergoes two-stage training: pre-training on SynthText datasets and fine-tuning on real-world datasets like ICDAR 2013, ICDAR 2015, and COCO-Text. Data augmentation techniques and multi-scale training are employed to handle variations in the text presentation. The multi-task loss function is carefully designed to balance between RPN, Fast R-CNN, and mask branch losses efficiently.

Performance and Evaluation

Mask TextSpotter achieves state-of-the-art performance across a range of datasets, particularly excelling in detecting and recognizing text with arbitrary shapes (Table 1). It surpasses existing methods, especially when dealing with curved text, highlighting its robustness and applicability to real-world scenarios.

Conclusion

Mask TextSpotter represents a significant advancement in the field of scene text spotting. Its ability to seamlessly integrate detection and recognition into a unified framework while handling diverse text shapes offers enhanced accuracy and utility. Future improvements may focus on optimizing inference speed and exploring alternative detection strategies to reduce computational overhead.

Figures Referenced

Figure 3: Illustrations of different text spotting methods, highlighting the capability of Mask TextSpotter to handle varied text orientations.
Figure 1: Detailed architecture of Mask TextSpotter, showing the integration of detection and recognition processes.
Figure 2: The standalone recognition model architecture, emphasizing the feature-pyramid structure and dynamic result selection.

These figures collectively underscore the technical sophistication of Mask TextSpotter and its practical efficacy in complex text spotting environments.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Summary

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Introduction

Model Architecture

Text Instance and Character Segmentation

Spatial Attentional Module (SAM)

Training and Optimization

Performance and Evaluation

Conclusion

Figures Referenced

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (6)

Collections

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Summary

Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes

Introduction

Model Architecture

Text Instance and Character Segmentation

Spatial Attentional Module (SAM)

Training and Optimization

Performance and Evaluation

Conclusion

Figures Referenced

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research