Data Augmentation for Scene Text Recognition (2108.06949v1)

Published 16 Aug 2021 in cs.CV

Abstract: Scene text recognition (STR) is a challenging task in computer vision due to the large number of possible text appearances in natural scenes. Most STR models rely on synthetic datasets for training since there are no sufficiently big and publicly available labelled real datasets. Since STR models are evaluated using real data, the mismatch between training and testing data distributions results into poor performance of models especially on challenging text that are affected by noise, artifacts, geometry, structure, etc. In this paper, we introduce STRAug which is made of 36 image augmentation functions designed for STR. Each function mimics certain text image properties that can be found in natural scenes, caused by camera sensors, or induced by signal processing operations but poorly represented in the training dataset. When applied to strong baseline models using RandAugment, STRAug significantly increases the overall absolute accuracy of STR models across regular and irregular test datasets by as much as 2.10% on Rosetta, 1.48% on R2AM, 1.30% on CRNN, 1.35% on RARE, 1.06% on TRBA and 0.89% on GCRNN. The diversity and simplicity of API provided by STRAug functions enable easy replication and validation of existing data augmentation methods for STR. STRAug is available at https://github.com/roatienza/straug.

PDF Abstract

Data Augmentation for Scene Text Recognition

The paper "Data Augmentation for Scene Text Recognition" provides a thorough exploration of the challenges faced in Scene Text Recognition (STR), a key area in computer vision. STR involves reading and recognizing text embedded in complex natural images. The primary challenge stems from the diverse and often unpredictable appearance of text, which varies across geometry, noise artifacts, and other factors influenced by natural environments. Given the lack of extensive, labeled real-world datasets, the paper proposes an innovative data augmentation approach, known as STRAug, to improve STR model performance.

STRAug and Evaluation

STRAug introduces a library of 36 distinct image augmentation functions designed explicitly for STR. Each function simulates various real-world conditions or imperfections that are poorly represented in synthetic training datasets. Implementing STRAug through a RandAugment strategy, the authors achieved notable improvements in absolute accuracy across several established STR models, including Rosetta, R2AM, CRNN, RARE, TRBA, and GCRNN. The enhancements range from 0.89% for GCRNN to 2.10% for Rosetta, as validated on regular and irregular text datasets such as ICDAR and CUTE80.

Analysis of Augmentation Groups

STRAug's functions are categorized into 8 logical groups: Warp, Geometry, Noise, Blur, Weather, Camera, Pattern, and Process. Each group targets different aspects of text image variability observed in natural scenes. The Warp group addresses geometric deformations such as curves or distortions, whereas the Blur group mitigates issues like motion blur that might arise from camera imperfections or weather conditions. By conducting an ablation paper using the RARE model, the paper thoroughly investigates the contribution of each augmentation category, finding notable improvements particularly with Blur, Noise, and Geometry groups.

Comparative Study

The STRAug was benchmarked against established data augmentation strategies from recent STR works, namely SRN and PP-OCR augmentation methods. Through rigorous testing on a mix of synthetic (MJSynth and SynthText) and real-world datasets (ICDAR, SVT, CUTE80), STRAug consistently delivered superior accuracy improvements. This superiority is attributed to its extensive and fine-grained set of augmentations that more effectively mimic the conditions encountered in real-world text recognition scenarios.

Implications and Future Directions

The introduction of STRAug presents meaningful implications for both practical applications and theoretical advancements in STR. Practically, STRAug allows researchers to leverage more robust training data that aligns closer to real-world evaluation scenarios, thus potentially enhancing the deployment of more reliable text recognition systems. Theoretically, it provokes further discussions on how data augmentation strategies can bridge the domain gap between synthetic and real data, which remains a profound challenge in computer vision.

Future developments in AI and STR might explore adaptive augmentation techniques where models dynamically adjust the types and intensities of augmentations based on the evaluation data. Such advancements could further reduce the discrepancies between training and testing distributions, pushing the boundaries of STR capabilities.

In conclusion, this paper offers a comprehensive framework for improving STR model performance through a nuanced understanding and application of data augmentation techniques, evidenced by STRAug's significant performance gains across multiple baselines.

PDF Markdown Bookmark Chat (Pro)

Authors (1)

Rowel Atienza (10 papers)

Citations (17)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - roatienza/straug: Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision. (249 stars)