Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting
Abstract: Recent end-to-end trainable methods for scene text spotting, integrating detection and recognition, showed much progress. However, most of the current arbitrary-shape scene text spotters use region proposal networks (RPN) to produce proposals. RPN relies heavily on manually designed anchors and its proposals are represented with axis-aligned rectangles. The former presents difficulties in handling text instances of extreme aspect ratios or irregular shapes, and the latter often includes multiple neighboring instances into a single proposal, in cases of densely oriented text. To tackle these problems, we propose Mask TextSpotter v3, an end-to-end trainable scene text spotter that adopts a Segmentation Proposal Network (SPN) instead of an RPN. Our SPN is anchor-free and gives accurate representations of arbitrary-shape proposals. It is therefore superior to RPN in detecting text instances of extreme aspect ratios or irregular shapes. Furthermore, the accurate proposals produced by SPN allow masked RoI features to be used for decoupling neighboring text instances. As a result, our Mask TextSpotter v3 can handle text instances of extreme aspect ratios or irregular shapes, and its recognition accuracy won't be affected by nearby text or background noise. Specifically, we outperform state-of-the-art methods by 21.9 percent on the Rotated ICDAR 2013 dataset (rotation robustness), 5.9 percent on the Total-Text dataset (shape robustness), and achieve state-of-the-art performance on the MSRA-TD500 dataset (aspect ratio robustness). Code is available at: https://github.com/MhLiao/MaskTextSpotterV3
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper explores how to build a computer system that can find and read text inside everyday photos—things like signs, menus, labels, or logos. This task is called “scene text spotting.” The authors introduce a new method, called Mask TextSpotter v3, that is better at handling hard cases: text that is tilted, very long, or shaped like curves.
What questions are the researchers trying to answer?
The authors focus on three simple questions:
- How can we more accurately find text in photos when the text isn’t neatly straight or horizontal?
- How can we avoid mixing up nearby words or letters when they are close together?
- How can we build a single system that both finds the text and reads it accurately, even when the text is rotated, curved, or very long?
How does their method work?
To understand the approach, it helps to know how previous systems worked:
- Older systems used a tool called a Region Proposal Network (RPN) to suggest where text might be. RPN draws many rectangles (with different sizes and shapes) over the image and tries to find the best ones that cover text. These rectangles are called “anchors.”
- The problem: rectangles work fine for neat, horizontal words. But they struggle with curved text, long text lines (like sentences in Chinese), or words that are tilted. Rectangles also often grab multiple nearby words at once when text is dense.
The new idea in this paper:
- Instead of guessing rectangles, the method paints the likely text areas directly on the image (this is called “segmentation”). Think of it like coloring in the exact shape of the text, not just putting a box around it.
- This part of the system is called the Segmentation Proposal Network (SPN). It’s “anchor-free,” meaning it doesn’t rely on pre-made rectangles. It outlines text shapes as accurate polygons (multi-sided shapes) that fit curved or tilted text.
- After finding these polygon shapes, the system extracts just the features inside each polygon (and blocks everything outside it). This step is called “hard RoI masking.” In simple terms, it uses the polygon like a stencil so the system only “sees” the text it’s supposed to read, not the background or neighboring words.
A simple analogy:
- Imagine a messy chalkboard with lots of writing at different angles. An old method would try to place rectangular sticky notes over the writing. Sometimes one sticky note would cover parts of several lines. The new method first carefully traces the exact shapes of the words with a marker, then uses those outlines to read each word individually, ignoring everything outside the outline.
Under the hood, the system has several parts working together:
- A “backbone” (ResNet-50) that looks at the image and builds useful features.
- The SPN that predicts where text is by painting precise shapes.
- A refinement step (Fast R-CNN and a text instance segmentation module) to fine-tune locations.
- A recognition module (including an attention mechanism) that reads the letters and forms words.
Two extra tricks help it separate close text:
- Shrink then grow: During training, the system slightly shrinks the labeled text areas to make sure neighboring words don’t merge. After finding them, it grows the areas back to full size.
- Hard masking: It multiplies the image features by a “mask” that’s 1 inside the polygon and 0 outside. That forces the system to focus on the target word only.
What did they find?
The method was tested on several standard benchmarks, each designed to stress a different challenge:
- Rotation robustness (Rotated ICDAR 2013): Photos were turned at various angles (15°, 30°, 45°, 60°, 75°, 90°). Mask TextSpotter v3 stayed accurate even at large rotations, beating strong previous methods by a large margin. For example, at 45° rotation, it improved end-to-end (detect-and-read) accuracy by about 21.9%.
- Aspect ratio robustness (MSRA-TD500): This dataset has long text lines (often multi-language). The new method handled these very well, improving over a leading prior method by 9.3% and achieving top performance.
- Shape robustness (Total-Text): This dataset includes curved and irregular text. The new method improved end-to-end recognition by 5.9% compared to the previous best.
- Small text (ICDAR 2015): Many images have small, low-resolution text. The new system showed better performance than the earlier version across different dictionary settings, especially with a large, realistic dictionary (“generic lexicon”).
Overall, the key improvements came from:
- More accurate shapes (polygons instead of rectangles) for text regions.
- The hard masking that removes distractions from nearby text and background.
- Training with strong rotation augmentations so the system learns to handle tilted text.
Why is this important?
- Real-life text is often messy: tilted signs, curved logos, long lines in different scripts, or small labels. A system that can handle all of these makes phone apps, assistive tools, and search engines more reliable.
- Better accuracy means fewer mistakes when reading menus, street signs, product packages, and more.
- The idea of using segmentation-based proposals (polygons) instead of rectangles could help other computer vision tasks where objects aren’t box-shaped.
What’s the takeaway?
Mask TextSpotter v3 shows that:
- Drawing and using precise text shapes (polygons) is more powerful than guessing rectangles, especially for rotated, long, or curved text.
- Strictly masking out everything outside the target text region makes recognition more accurate and less confused by clutter.
- These changes lead to strong gains across multiple tough benchmarks.
In the future, the authors note the system still finds extremely rotated text (like 90°) a bit tricky for reading direction, and they plan to make the reader more rotation-aware. But overall, this work pushes scene text spotting forward and opens the door for better, more robust OCR in the real world.
Collections
Sign up for free to add this paper to one or more collections.