SPTS: Single-Point Text Spotting (2112.07917v6)

Published 15 Dec 2021 in cs.CV

Abstract: Existing scene text spotting (i.e., end-to-end text detection and recognition) methods rely on costly bounding box annotations (e.g., text-line, word-level, or character-level bounding boxes). For the first time, we demonstrate that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance. We propose an end-to-end scene text spotting method that tackles scene text spotting as a sequence prediction task. Given an image as input, we formulate the desired detection and recognition results as a sequence of discrete tokens and use an auto-regressive Transformer to predict the sequence. The proposed method is simple yet effective, which can achieve state-of-the-art results on widely used benchmarks. Most significantly, we show that the performance is not very sensitive to the positions of the point annotation, meaning that it can be much easier to be annotated or even be automatically generated than the bounding box that requires precise positions. We believe that such a pioneer attempt indicates a significant opportunity for scene text spotting applications of a much larger scale than previously possible. The code is available at https://github.com/shannanyinxiang/SPTS.

PDF Abstract

An Expert Overview of "SPTS: Single-Point Text Spotting"

The paper "SPTS: Single-Point Text Spotting" introduces an innovative approach to scene text spotting, where each text instance is represented by a single-point annotation rather than the conventional bounding box annotations that are typically more labor-intensive and costly. This method, Single-Point Text Spotting (SPTS), is situated within the context of current advancements in Optical Character Recognition (OCR) and aims to streamline the annotation process while maintaining competitive accuracy in detecting and recognizing textual content in images.

Traditionally, scene text spotting has relied heavily on bounding box annotations, whether at the text-line, word-level, or character-level. This paper suggests an alternative model where a single point is sufficient to train a scene text spotting model. The authors propose treating text spotting as a sequence prediction task. Using an auto-regressive Transformer, the model predicts sequences that express both the location of the text and the transcription results.

Several key attributes distinguish SPTS from previous methods:

Annotation Simplicity: The reduction to a single-point annotation marks a significant simplification. Empirically, the authors note that annotation with a single point is considerably faster—for example, speeding up the process by 50 times compared to character-level annotations.
Model Architecture: SPTS employs a Transformer-based architecture. The process involves extracting features via a CNN, followed by encoding these features in a Transformer encoder. The rich representation is then auto-regressively decoded into sequences of discrete tokens, each representing elements like coordinates and transcription. This design choice aims to leverage the powerful sequence modeling capacities of Transformers, recently popularized through models such as Pix2Seq.
Benchmark Performance: The paper demonstrates that despite its simplicity, SPTS achieves state-of-the-art performance across various benchmarks, including ICDAR 2013, ICDAR 2015, Total-Text, and SCUT-CTW1500. Notably, on datasets involving arbitrarily shaped texts, SPTS shows impressive results, likely due to its independence from detection results which often limit prior models.
Weak Supervision Capability: By showing the efficacy of using single-point annotations, the paper opens avenues for weakly supervised scene text spotting, which is crucial for scaling annotations to larger datasets.

The authors acknowledge limitations in handling tiny text instances, where precision remains a challenge, possibly because the model predictions are directly tied to the lower resolution features lacking RoI operations typical of bounding-box-based methods.

Implications and Future Directions

The implications of this approach extend beyond efficiency in data annotation:

Scalability: Significantly reducing annotation effort enables larger-scale dataset creation, which could particulary benefit languages and scripts underrepresented in current datasets.
Cross-domain Applications: The model's efficiency could apply to other domains in computer vision, particularly where small or occluded objects are typical. The authors have begun exploring such crossover potential with preliminary experiments in object detection using single-point annotations.
Further Leveraging Transformer Architectures: As Transformer models continue to evolve with enhanced capabilities in contextual representation, they could further boost the performance of sequence prediction tasks like SPTS, particularly in robust scaling and improving inference efficiency.
Minimal Annotation Studies: Future exploration into even more minimal annotation techniques (such as zero annotation methods suggested by the No-Point Text Spotting variant) could further challenge the boundaries of model training paradigms.

Overall, the "SPTS: Single-Point Text Spotting" paper proposes a compelling simplification in scene text spotting. While addressing some of the inherent challenges, it significantly progresses towards more efficient and scalable OCR systems.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Dezhi Peng (21 papers)
Xinyu Wang (186 papers)
Yuliang Liu (82 papers)
Jiaxin Zhang (105 papers)
Mingxin Huang (14 papers)
Songxuan Lai (13 papers)
Shenggao Zhu (9 papers)
Jing Li (621 papers)
Dahua Lin (336 papers)
Chunhua Shen (404 papers)
Xiang Bai (221 papers)
Lianwen Jin (116 papers)

Citations (51)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - shannanyinxiang/SPTS: Official implementation of SPTS: Single-Point Text Spotting (142 stars)