Papers
Topics
Authors
Recent
2000 character limit reached

Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting

Published 18 Jul 2020 in cs.CV | (2007.09482v1)

Abstract: Recent end-to-end trainable methods for scene text spotting, integrating detection and recognition, showed much progress. However, most of the current arbitrary-shape scene text spotters use region proposal networks (RPN) to produce proposals. RPN relies heavily on manually designed anchors and its proposals are represented with axis-aligned rectangles. The former presents difficulties in handling text instances of extreme aspect ratios or irregular shapes, and the latter often includes multiple neighboring instances into a single proposal, in cases of densely oriented text. To tackle these problems, we propose Mask TextSpotter v3, an end-to-end trainable scene text spotter that adopts a Segmentation Proposal Network (SPN) instead of an RPN. Our SPN is anchor-free and gives accurate representations of arbitrary-shape proposals. It is therefore superior to RPN in detecting text instances of extreme aspect ratios or irregular shapes. Furthermore, the accurate proposals produced by SPN allow masked RoI features to be used for decoupling neighboring text instances. As a result, our Mask TextSpotter v3 can handle text instances of extreme aspect ratios or irregular shapes, and its recognition accuracy won't be affected by nearby text or background noise. Specifically, we outperform state-of-the-art methods by 21.9 percent on the Rotated ICDAR 2013 dataset (rotation robustness), 5.9 percent on the Total-Text dataset (shape robustness), and achieve state-of-the-art performance on the MSRA-TD500 dataset (aspect ratio robustness). Code is available at: https://github.com/MhLiao/MaskTextSpotterV3

Citations (166)

Summary

  • The paper introduces a Segmentation Proposal Network that replaces anchor-dependent methods with an anchor-free technique generating polygonal proposals.
  • It employs a U-Net structure to produce segmentation masks and utilizes hard RoI masking to eliminate background noise for enhanced text recognition.
  • Results showcase significant improvements in detecting rotated, extreme aspect ratio, and irregularly shaped text across multiple benchmarks.

Overview of "Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting"

Mask TextSpotter v3 presents a Segmentation Proposal Network (SPN), which is engineered to better handle challenges in scene text spotting by overcoming the limitations of Region Proposal Network (RPN)-based approaches. The paper introduces an end-to-end trainable approach, integrating the text detection and recognition tasks seamlessly and improving robustness to rotations, aspect ratios, and text shapes.

Segmentation Proposal Network

The Segmentation Proposal Network is designed to address two key issues with traditional RPN: the dependency on manually designed anchors, which limits adaptability to text instances with extreme aspect ratios or irregular shapes, and the reliance on axis-aligned rectangles, which complicates handling densely oriented text. SPN employs an anchor-free approach, generating precise polygonal representations of proposals, enhancing accuracy in detection and recognition tasks. Figure 1

Figure 1: Comparisons between RPN and SPN. The SPN-based text spotter efficiently separates individual text instances using polygonal proposals.

Mask TextSpotter v3 achieves this by implementing a U-Net structure within the SPN, which generates proposals directly from the segmentation masks, thus facilitating the handling of arbitrary-shape text instances. This method not only maintains high precision under diverse aspect ratios but also improves handling of densely packed text and irregular shapes.

Hard RoI Masking

Integral to the effectiveness of Mask TextSpotter v3 is the hard RoI masking technique. By utilizing binary polygon masks derived from SPN-generated proposals, hard RoI masking ensures that the region of interest features are free from background noise and neighboring text interference. Figure 2

Figure 2: Overview of Mask TextSpotter v3. Hard RoI masking excludes extraneous regions, enhancing recognition accuracy.

This method significantly improves robustness and mitigates errors in detection and recognition. The proposal-generated polygons allow precise application of masks without the inaccuracies associated with RPN-based approaches.

Results and Performance

Mask TextSpotter v3 showcased superior performance across various datasets, substantiating its robustness to several challenging text variations:

  1. Rotation Robustness: On the Rotated ICDAR 2013 dataset, Mask TextSpotter v3 demonstrated a 21.9% improvement over state-of-the-art methods in end-to-end recognition, particularly excelling under rotations like 4545^\circ and 6060^\circ. Figure 3

    Figure 3: Detection and end-to-end recognition results reflect strong performance against varying rotation angles.

  2. Aspect Ratio Robustness: The MSRA-TD500 dataset results highlighted Mask TextSpotter v3's proficiency in handling text lines of extreme aspect ratios by achieving an F-measure of 83.5%. Figure 4

    Figure 4: Qualitative results demonstrate detection of long text lines under challenging aspect ratios.

  3. Shape Robustness: On the Total-Text dataset, Mask TextSpotter v3 maintained high recognition accuracy, yielding a 5.9% improvement in F-measure for irregularly shaped text compared to previous methods. Figure 5

    Figure 5: Mask TextSpotter v3 reveals its capability with text instances of varying shapes and orientations.

The proposed method outperforms existing solutions in complex real-world scenarios, such as handling small text samples in the IC15 benchmark.

Implications and Future Work

This research offers impactful contributions to the field of optical character recognition (OCR). The adoption of SPN and hard RoI masking encourages adoption in new domains requiring sharper proposal generation and segmentation in high-noise environments.

While Mask TextSpotter v3 excels in multiple aspects, future work might focus on enhancing recognition mechanisms to further bolster rotation robustness, particularly in scenarios involving 9090^\circ rotations. Additionally, expanding SPN's application to generic object detection tasks might yield further insights.

Conclusion

In conclusion, Mask TextSpotter v3 introduces substantial improvements in scene text spotting by leveraging the Segmentation Proposal Network for more accurate proposal generation, alongside applying hard RoI masking for precise feature extraction. These innovations significantly enhance the system’s ability to manage arbitrary text shapes, rotations, and challenging aspect ratios, setting a new bar for robustness and performance across diverse benchmarks.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What is this paper about?

This paper explores how to build a computer system that can find and read text inside everyday photos—things like signs, menus, labels, or logos. This task is called “scene text spotting.” The authors introduce a new method, called Mask TextSpotter v3, that is better at handling hard cases: text that is tilted, very long, or shaped like curves.

What questions are the researchers trying to answer?

The authors focus on three simple questions:

  • How can we more accurately find text in photos when the text isn’t neatly straight or horizontal?
  • How can we avoid mixing up nearby words or letters when they are close together?
  • How can we build a single system that both finds the text and reads it accurately, even when the text is rotated, curved, or very long?

How does their method work?

To understand the approach, it helps to know how previous systems worked:

  • Older systems used a tool called a Region Proposal Network (RPN) to suggest where text might be. RPN draws many rectangles (with different sizes and shapes) over the image and tries to find the best ones that cover text. These rectangles are called “anchors.”
  • The problem: rectangles work fine for neat, horizontal words. But they struggle with curved text, long text lines (like sentences in Chinese), or words that are tilted. Rectangles also often grab multiple nearby words at once when text is dense.

The new idea in this paper:

  • Instead of guessing rectangles, the method paints the likely text areas directly on the image (this is called “segmentation”). Think of it like coloring in the exact shape of the text, not just putting a box around it.
  • This part of the system is called the Segmentation Proposal Network (SPN). It’s “anchor-free,” meaning it doesn’t rely on pre-made rectangles. It outlines text shapes as accurate polygons (multi-sided shapes) that fit curved or tilted text.
  • After finding these polygon shapes, the system extracts just the features inside each polygon (and blocks everything outside it). This step is called “hard RoI masking.” In simple terms, it uses the polygon like a stencil so the system only “sees” the text it’s supposed to read, not the background or neighboring words.

A simple analogy:

  • Imagine a messy chalkboard with lots of writing at different angles. An old method would try to place rectangular sticky notes over the writing. Sometimes one sticky note would cover parts of several lines. The new method first carefully traces the exact shapes of the words with a marker, then uses those outlines to read each word individually, ignoring everything outside the outline.

Under the hood, the system has several parts working together:

  • A “backbone” (ResNet-50) that looks at the image and builds useful features.
  • The SPN that predicts where text is by painting precise shapes.
  • A refinement step (Fast R-CNN and a text instance segmentation module) to fine-tune locations.
  • A recognition module (including an attention mechanism) that reads the letters and forms words.

Two extra tricks help it separate close text:

  • Shrink then grow: During training, the system slightly shrinks the labeled text areas to make sure neighboring words don’t merge. After finding them, it grows the areas back to full size.
  • Hard masking: It multiplies the image features by a “mask” that’s 1 inside the polygon and 0 outside. That forces the system to focus on the target word only.

What did they find?

The method was tested on several standard benchmarks, each designed to stress a different challenge:

  • Rotation robustness (Rotated ICDAR 2013): Photos were turned at various angles (15°, 30°, 45°, 60°, 75°, 90°). Mask TextSpotter v3 stayed accurate even at large rotations, beating strong previous methods by a large margin. For example, at 45° rotation, it improved end-to-end (detect-and-read) accuracy by about 21.9%.
  • Aspect ratio robustness (MSRA-TD500): This dataset has long text lines (often multi-language). The new method handled these very well, improving over a leading prior method by 9.3% and achieving top performance.
  • Shape robustness (Total-Text): This dataset includes curved and irregular text. The new method improved end-to-end recognition by 5.9% compared to the previous best.
  • Small text (ICDAR 2015): Many images have small, low-resolution text. The new system showed better performance than the earlier version across different dictionary settings, especially with a large, realistic dictionary (“generic lexicon”).

Overall, the key improvements came from:

  • More accurate shapes (polygons instead of rectangles) for text regions.
  • The hard masking that removes distractions from nearby text and background.
  • Training with strong rotation augmentations so the system learns to handle tilted text.

Why is this important?

  • Real-life text is often messy: tilted signs, curved logos, long lines in different scripts, or small labels. A system that can handle all of these makes phone apps, assistive tools, and search engines more reliable.
  • Better accuracy means fewer mistakes when reading menus, street signs, product packages, and more.
  • The idea of using segmentation-based proposals (polygons) instead of rectangles could help other computer vision tasks where objects aren’t box-shaped.

What’s the takeaway?

Mask TextSpotter v3 shows that:

  • Drawing and using precise text shapes (polygons) is more powerful than guessing rectangles, especially for rotated, long, or curved text.
  • Strictly masking out everything outside the target text region makes recognition more accurate and less confused by clutter.
  • These changes lead to strong gains across multiple tough benchmarks.

In the future, the authors note the system still finds extremely rotated text (like 90°) a bit tricky for reading direction, and they plan to make the reader more rotation-aware. But overall, this work pushes scene text spotting forward and opens the door for better, more robust OCR in the real world.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.