SPTS v2: Single-Point Scene Text Spotting

Published 4 Jan 2023 in cs.CV and cs.AI | (2301.01635v4)

Abstract: End-to-end scene text spotting has made significant progress due to its intrinsic synergy between text detection and recognition. Previous methods commonly regard manual annotations such as horizontal rectangles, rotated rectangles, quadrangles, and polygons as a prerequisite, which are much more expensive than using single-point. Our new framework, SPTS v2, allows us to train high-performing text-spotting models using a single-point annotation. SPTS v2 reserves the advantage of the auto-regressive Transformer with an Instance Assignment Decoder (IAD) through sequentially predicting the center points of all text instances inside the same predicting sequence, while with a Parallel Recognition Decoder (PRD) for text recognition in parallel, which significantly reduces the requirement of the length of the sequence. These two decoders share the same parameters and are interactively connected with a simple but effective information transmission process to pass the gradient and information. Comprehensive experiments on various existing benchmark datasets demonstrate the SPTS v2 can outperform previous state-of-the-art single-point text spotters with fewer parameters while achieving 19$\times$ faster inference speed. Within the context of our SPTS v2 framework, our experiments suggest a potential preference for single-point representation in scene text spotting when compared to other representations. Such an attempt provides a significant opportunity for scene text spotting applications beyond the realms of existing paradigms. Code is available at: https://github.com/Yuliang-Liu/SPTSv2.

Abstract PDF Upgrade to Chat

Citations (34)

View on Semantic Scholar

Summary

The paper introduces a novel single-point annotation technique that simplifies text spotting and minimizes manual labeling requirements.
It utilizes an Instance Assignment Decoder (IAD) and a Parallel Recognition Decoder (PRD) to perform localization and recognition concurrently.
The framework achieves 19x faster inference with fewer parameters while surpassing benchmarks like ICDAR and Total-Text.

Overview of SPTS v2: Single-Point Scene Text Spotting

The paper "SPTS v2: Single-Point Scene Text Spotting" presents a novel approach for text spotting in images using only a single-point annotation methodology. The SPTS v2 framework is developed to reduce the manual annotation effort and to perform end-to-end scene text spotting more efficiently than previous methods. The paper details a number of innovative strategies, including an Instance Assignment Decoder (IAD) and a Parallel Recognition Decoder (PRD) that allow the system to operate with fewer parameters and at significantly higher speeds.

Key Contributions

Simplified Annotation and Efficiency Gains

Traditionally, scene text spotting methods rely on bounding box annotations, which are manually intensive and complex. The authors propose using single-point annotations, drastically reducing the annotation workload while maintaining competitive performance.

Figure 1: Existing OCR methods typically use bounding boxes to represent the text area. However, inspired by how humans can intuitively read texts without such a defined region, this paper demonstrates that a single point is sufficient for guiding the model to learn a strong scene text spotter.

SPTS v2 achieves substantial efficiency gains by significantly reducing both the number of parameters and the inference time, achieving 19x faster performance compared to the previous state-of-the-art single-point text spotting techniques.

Architecture: Instance Assignment and Parallel Recognition Decoders

The proposed architecture relies on an innovative dual-decoder strategy, which uses a shared set of decoder parameters for both text localization and recognition tasks:

Instance Assignment Decoder (IAD): Predicts the coordinates of the center points for all text instances. This step leverages an auto-regressive mechanism to identify text locations within a sequence of points.
Parallel Recognition Decoder (PRD): Conducts simultaneous recognition of text using these center points as guides. This decoder parallelizes the recognition task across multiple identified instances, significantly decreasing processing time.
Figure 2: Overall framework of the proposed SPTS v2. The visual and contextual features are first extracted by a series of CNN and Transformer encoders. Then, the features are auto-regressively decoded into a sequence that contains localization and recognition information through IAD and PRD, respectively. For IAD, it predicts coordinates of all center points of text instances inside the same sequence, while for the PRD, the recognition results are predicted in parallel. Note that IAD shares identical parameters with PRD, and thus no additional parameters are introduced for the PRD stage.

Experimental Results

The experiments conducted reveal that SPTS v2 outperforms previous state-of-the-art models across several text spotting benchmarks. These include datasets such as Total-Text, SCUT-CTW1500, ICDAR 2013, and ICDAR 2015. Notably, SPTS v2 shows strength in handling arbitrarily-shaped text detection and efficiently utilizing single-point annotations.

Figure 3: Qualitative results on the scene text benchmarks. Images are selected from Total-Text (first row), SCUT-CTW1500 (second row), ICDAR 2013 (third row), ICDAR 2015 (fourth row), and Inverse-Text (fifth row). Best viewed on screen.

Implications and Future Work

The shift towards a single-point annotation approach has significant implications for the practical adoption of scene text spotting technologies. By minimizing manual annotation efforts and improving speed, more resources can be dedicated to enhancing model performance and exploring new application domains. In the future, extending SPTS v2 to support additional languages and more complex text configurations can further its applicability.

Conclusion

SPTS v2 shows that it is possible to perform efficient and powerful text spotting using a minimalist annotation approach. This method not only suggests potential cost savings and efficiency improvements in creating datasets but also opens new doors for scalable real-world applications in document processing, autonomous driving, and augmented reality. The integration of deep learning with intuitive human-like recognition tasks marks an important step forward in AI-driven text processing technologies.

Markdown