Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting (2007.09482v1)

Published 18 Jul 2020 in cs.CV
Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting

Abstract: Recent end-to-end trainable methods for scene text spotting, integrating detection and recognition, showed much progress. However, most of the current arbitrary-shape scene text spotters use region proposal networks (RPN) to produce proposals. RPN relies heavily on manually designed anchors and its proposals are represented with axis-aligned rectangles. The former presents difficulties in handling text instances of extreme aspect ratios or irregular shapes, and the latter often includes multiple neighboring instances into a single proposal, in cases of densely oriented text. To tackle these problems, we propose Mask TextSpotter v3, an end-to-end trainable scene text spotter that adopts a Segmentation Proposal Network (SPN) instead of an RPN. Our SPN is anchor-free and gives accurate representations of arbitrary-shape proposals. It is therefore superior to RPN in detecting text instances of extreme aspect ratios or irregular shapes. Furthermore, the accurate proposals produced by SPN allow masked RoI features to be used for decoupling neighboring text instances. As a result, our Mask TextSpotter v3 can handle text instances of extreme aspect ratios or irregular shapes, and its recognition accuracy won't be affected by nearby text or background noise. Specifically, we outperform state-of-the-art methods by 21.9 percent on the Rotated ICDAR 2013 dataset (rotation robustness), 5.9 percent on the Total-Text dataset (shape robustness), and achieve state-of-the-art performance on the MSRA-TD500 dataset (aspect ratio robustness). Code is available at: https://github.com/MhLiao/MaskTextSpotterV3

Overview of Mask TextSpotter v3: Enhancing Scene Text Spotting

The paper "Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting" introduces an advanced framework for the complex task of scene text spotting, improving upon previous methodologies by leveraging a Segmentation Proposal Network (SPN). This framework aims to address limitations in existing Region Proposal Networks (RPN), particularly when handling text with varied orientations, aspect ratios, and shapes.

Methodological Advancements

The highlighted innovation in Mask TextSpotter v3 is the replacement of the traditional RPN with an SPN, which inherently provides more accurate polygonal representation of text proposals. Unlike RPN that depends on axis-aligned rectangles and pre-designed anchors, SPN is anchor-free, ensuring superior adaptability to text instances of extreme aspect ratios and irregular shapes. This advantage is crucial for scenarios with dense text arrangements or non-standard text orientations.

Furthermore, the paper presents the "hard RoI masking," a mechanism to enhance RoI feature extraction by eliminating noise from neighboring text instances. This approach involves applying binary polygon masks directly to RoI features, enabling precise detection and recognition without interference from closely positioned textual elements.

Experimental Results

The research demonstrates notable performance improvements across various datasets. On the Rotated ICDAR 2013 dataset, Mask TextSpotter v3 achieved a substantial 21.9% increase in end-to-end recognition F-measure compared to existing methods. Similarly, on the Total-Text dataset, which features diverse text shapes, the proposed method surpassed state-of-the-art results by 5.9%. Results on the MSRA-TD500 dataset further validate the robustness of the model towards extreme aspect ratios, achieving impressive detection accuracy even with challenging text formations.

Implications and Future Directions

The introduction of an SPN-based framework redefines the capability of scene text spotters to adapt to arbitrary text configurations, paving the way for more comprehensive OCR solutions. The improvements seen in Mask TextSpotter v3 suggest promising applications in real-world scenarios where text may not adhere to standardized orientations or forms, such as in logos or creative designs.

Future developments could explore enhancing the SPN concept to handle even more complex scenarios, like dynamic video texts or multi-lingual environments. Additionally, integrating this framework with neural networks more specialized in text recognition might further improve its efficacy. The potential for adapting this technology beyond scene text spotting to general object instance segmentation tasks is also evident, given its robust proposal generation capabilities.

In summary, Mask TextSpotter v3 signifies a meaningful step forward in scene text understanding, offering a practical and theoretical foundation for further advancements in AI-driven text recognition systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Minghui Liao (29 papers)
  2. Guan Pang (19 papers)
  3. Jing Huang (140 papers)
  4. Tal Hassner (48 papers)
  5. Xiang Bai (221 papers)
Citations (166)