A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (2203.09388v2)

Published 17 Mar 2022 in cs.CV

Abstract: Scene text image super-resolution aims to increase the resolution and readability of the text in low-resolution images. Though significant improvement has been achieved by deep convolutional neural networks (CNNs), it remains difficult to reconstruct high-resolution images for spatially deformed texts, especially rotated and curve-shaped ones. This is because the current CNN-based methods adopt locality-based operations, which are not effective to deal with the variation caused by deformations. In this paper, we propose a CNN based Text ATTention network (TATT) to address this problem. The semantics of the text are firstly extracted by a text recognition module as text prior information. Then we design a novel transformer-based module, which leverages global attention mechanism, to exert the semantic guidance of text prior to the text reconstruction process. In addition, we propose a text structure consistency loss to refine the visual appearance by imposing structural consistency on the reconstructions of regular and deformed texts. Experiments on the benchmark TextZoom dataset show that the proposed TATT not only achieves state-of-the-art performance in terms of PSNR/SSIM metrics, but also significantly improves the recognition accuracy in the downstream text recognition task, particularly for text instances with multi-orientation and curved shapes. Code is available at https://github.com/mjq11302010044/TATT.

PDF Abstract

A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution

The paper introduces a CNN-based Text Attention Network (TATT) designed to address the limitations of existing deep-learning methods in super-resolving scene text images that exhibit spatial deformations such as rotations and curves. Scene Text Image Super-resolution (STISR) is critical for enhancing the resolution and readability of low-resolution images, which in turn improves downstream tasks like scene text recognition.

Overview of Methods

The TATT model integrates a text recognition module for extracting text semantics, which serves as prior information to guide the text reconstruction process. Unlike traditional CNNs which rely on local operations incapable of effectively handling spatial deformations, TATT leverages a novel transformer-based module employing global attention mechanisms. This module facilitates the interaction between text semantics and image features over long ranges, enhancing its ability to manage spatial variations in text images.

The methodology is structured into key stages:

Text Prior Extraction: The initial step involves a text recognition module generating text prior information that encapsulates the semantic content of the low-resolution image.
Global Attention Mechanism: The core innovation is the TP Interpreter, a transformer-based module that applies global cross-attention. This module facilitates rich interaction between extracted text semantic prior and image features in spatial domains, ensuring robust reconstruction performance on deformed text images.
Text Structure Consistency Loss: This loss function aims to refine text visuals by enforcing structural consistency across reconstructions of both regular and deformed text images. The loss ensures better visual quality and disambiguates the text content in distorted images.

Experimental Results

The performance evaluation of TATT is based on the TextZoom dataset, where it demonstrates superior results in PSNR and SSIM metrics, along with notable improvements in text recognition accuracy. Specifically, TATT achieves state-of-the-art performance, even outperforming multi-stage models like TPGSR. The addition of the Text Structure Consistency loss further enhances the reconstruction of spatially deformed texts, which is particularly beneficial in challenging scenarios such as curved or rotated text.

Implications and Future Directions

The practical impact of TATT lies in its ability to significantly boost OCR accuracy in real-world scenarios, where text images often present distortions and challenging perspectives. By utilizing transformer-based models with global attention, TATT opens avenues not only for improved text image super-resolution but also potential applications in domains requiring handling of spatially complex data, such as medical imaging or augmented reality systems.

The theoretical contributions enhance existing literature by merging text recognition with super-resolution tasks under a unified framework, which can lead to novel transformer network architectures beyond traditional CNN realms.

Looking forward, challenges remain in optimizing computational demands and further improving robustness against extreme image distortions. Reducing computational complexity, particularly with transformer-based global attention, could expand applicability in real-time applications or resource-constrained environments. Additionally, further exploration of TATT in diverse datasets and higher-resolution tasks may offer insights into transfer learning capabilities and scaling efficiencies.

Overall, the paper delivers significant advancements in STISR through an innovative application of attention mechanisms, laying groundwork for future research at the intersection of super-resolution and semantic image processing.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Jianqi Ma (13 papers)
Zhetong Liang (7 papers)
Lei Zhang (1689 papers)

Citations (36)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - mjq11302010044/TATT: A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (CVPR2022) (168 stars)

Tweets

https://twitter.com/naiveoculus/status/1539020715617312768