FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting (2408.14998v2)

Published 27 Aug 2024 in cs.CV

Abstract: The proliferation of scene text in both structured and unstructured environments presents significant challenges in optical character recognition (OCR), necessitating more efficient and robust text spotting solutions. This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer Encoder-Decoder architecture, enhanced by a novel, faster self-attention unit, SAC2, to improve processing speeds while maintaining accuracy. FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts, benchmarking against current state-of-the-art models. Our results indicate that FastTextSpotter not only achieves superior accuracy in detecting and recognizing multilingual scene text (English and Vietnamese) but also improves model efficiency, thereby setting new benchmarks in the field. This study underscores the potential of advanced transformer architectures in improving the adaptability and speed of text spotting applications in diverse real-world settings. The dataset, code, and pre-trained models have been released in our Github.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces FastTextSpotter, a framework that integrates a Swin Transformer backbone, dual-decoder Transformer, and SAC2 attention to boost multilingual scene text spotting efficiency.
It employs a dynamic query formulation and reference point sampling strategy to precisely detect and recognize text across various orientations and languages.
Experimental results on ICDAR2015, TotalText, CTW1500, and VinText show superior accuracy and recall compared to existing state-of-the-art models.

FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting

The paper presents FastTextSpotter, a novel framework geared towards efficient multilingual scene text spotting, leveraging advanced transformer architectures to enhance processing speed without sacrificing accuracy. This work integrates a Swin Transformer visual backbone with a Transformer Encoder-Decoder setup, enriched by the SAC2 attention module to bolster both speed and adaptability.

Introduction

Scene text spotting entails the localization and recognition of text within natural scenes, often presenting challenges due to diverse orientations, languages, and annotation styles. Existing state-of-the-art models have significantly bolstered text detection capabilities using CNNs and Transformers, yet struggle to balance precision with processing efficiency, particularly in time-sensitive environments. The novel FastTextSpotter framework, detailed in this paper, aims to address these challenges, demonstrating improved scene text detection and recognition across varied datasets.

FastTextSpotter is built upon a flexible architecture, incorporating the Swin Transformer framework along with a double-decoder Transformer setup. Swin Transformer offers robust feature extraction capabilities crucial for recognizing text in arbitrary shapes, optimized by the SAC2 attention module - a faster self-attention unit. This specialized module is pivotal in accelerating processing speeds, making FastTextSpotter conducive for real-world applications requiring multilingual text spotting.

Figure 1: Overview of FastTextSpotter illustrating a Swin Transformer visual backbone with a Transformer Encoder-Decoder framework. Key features include the SAC2 attention module, dual decoders for accurate text localization and recognition, and the Reference Point Sampling system for effective text detection across various shapes and languages.

Methodology

Model Architecture

FastTextSpotter’s architecture centers around a Swin Transformer backbone, ensuring efficient visual feature extraction. Following this, a dual-decoder Transformer setup addresses text detection and recognition simultaneously, managing the complex localization of text regions across varying orientations and shapes. This setup is essential for maintaining accuracy while reducing computational demands.

SAC2 Attention Module

The SAC2 attention module represents a pivotal enhancement, facilitating faster training and model convergence. Its integration of circular convolution techniques into self-attention mechanisms ensures robust handling of spherical shapes, refining text spotting precision.

Query Point Formulation

A dynamic point update strategy, coupled with advanced reference point sampling, underscores the system’s efficiency. By systematically resampling points and refining positional queries through anchored boxes, FastTextSpotter ensures precise text detection within scenes, enhancing both learning and inference.

Figure 2: Trade-off between text spotting performance h-mean vs number of training iterations: The blue curve indicates the model without the SAC2 attention module while the orange curve depicts the model performance with our proposed SAC2 module.

Experimental Results

FastTextSpotter demonstrates notable improvements across benchmark datasets, including ICDAR2015, TotalText, CTW1500, and VinText. Compared to prior models, FastTextSpotter achieves superior accuracy in end-to-end recognition tasks on these benchmarks, particularly outperforming competitors in recall metrics.

Figure 3: Some illustration of our method on different datasets. Zoom in for better visualization. First two images from Total-Text, third and fourth images from CTW1500, fifth and sixth images from ICDAR15, and the last two images from Vintext.

Comparative Performance

When compared with state-of-the-art text spotting models, FastTextSpotter shows the best performance in detection precision and efficiency metrics. Its advantage lies in the use of optimized attention mechanisms and dynamic query adjustments, substantially accelerating processing speeds while maintaining high detection accuracy across complex language scenarios.

Conclusion

FastTextSpotter establishes itself as a robust and efficient solution for multilingual scene text spotting, accentuating its operational efficiency and accuracy improvements over existing models. The framework's versatility across diverse linguistic contexts, combined with its enhanced processing speed, presents significant applications for real-world OCR tasks. Future work envisages the extension of FastTextSpotter's capabilities to encompass a wider array of languages and script complexities, further exemplifying its utility in global text spotting applications.