UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World (2003.10608v6)

Published 24 Mar 2020 in cs.CV

Abstract: Synthetic data has been a critical tool for training scene text detection and recognition models. On the one hand, synthetic word images have proven to be a successful substitute for real images in training scene text recognizers. On the other hand, however, scene text detectors still heavily rely on a large amount of manually annotated real-world images, which are expensive. In this paper, we introduce UnrealText, an efficient image synthesis method that renders realistic images via a 3D graphics engine. 3D synthetic engine provides realistic appearance by rendering scene and text as a whole, and allows for better text region proposals with access to precise scene information, e.g. normal and even object meshes. The comprehensive experiments verify its effectiveness on both scene text detection and recognition. We also generate a multilingual version for future research into multilingual scene text detection and recognition. Additionally, we re-annotate scene text recognition datasets in a case-sensitive way and include punctuation marks for more comprehensive evaluations. The code and the generated datasets are released at: https://github.com/Jyouhou/UnrealText/ .

PDF Abstract

UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World

UnrealText presents a significant advancement in the synthesis of realistic scene text images, achieved by leveraging the capabilities of a 3D graphics engine. This development addresses the challenges of existing methods, particularly in the domain of scene text detection, which has traditionally relied heavily on real-world data due to the lack of effective synthetic data alternatives.

Overview of UnrealText Synthesis Method

UnrealText utilizes Unreal Engine 4 (UE4) to synthesize scene text images that are visually realistic, integrating both text and scenes cohesively. This approach surpasses the limitations of previous methods that embed text in 2D images without considering the overall scene dynamics, such as lighting, occlusion, and perspective transformations. By treating text as part of the three-dimensional environment, UnrealText ensures that the rendered images exhibit photo-realistic properties resembling natural scenes.

Key Components and Innovations

Viewfinder Module: This module systematically explores virtual 3D scenes to find diverse and viable camera perspectives, avoiding trivial viewpoints that do not contribute to the dataset's diversity.
Environment Randomization: Random alterations in lighting and other environmental factors mimic real-world variations, enhancing the diversity and realism of the generated synthetic data.
Text Region Generation: By accessing 3D mesh information, text regions are proposed with high precision, ensuring naturalistic integration of text elements into the scenes. This contrasts with earlier methods which often inaccurately positioned text due to imprecise estimations of scene information.
Text Rendering: Text is integrated into scenes through a mesh-based approach, applying textures to planar meshes that align with proposed regions, thus blending seamlessly into the 3D environment.

Empirical Evaluation

The efficacy of UnrealText was validated through extensive experiments on scene text detection and recognition tasks. When evaluated against benchmarks such as ICDAR 2013, ICDAR 2015, and MLT 2017, models trained using UnrealText data showcased notable improvements over those trained on previous synthetic datasets, such as SynthText and VISD. In terms of numbers, UnrealText achieved an F1 score of 65.2% on IC15, outperforming SynthText (46.3%) and VISD (64.3%) when using a dataset size of 10,000 images.

For recognition tasks, UnrealText also demonstrated superior performance, particularly in complex scenarios with challenging text attributes. The mean accuracy over various datasets showed a substantial increase when models were trained on UnrealText, confirming the robustness and applicability of synthesized data in training state-of-the-art recognition systems.

Implications and Future Directions

UnrealText's approach to synthesizing text images paves the way for enriching training datasets without the prohibitive cost and effort associated with manual data collection and annotation. The integration of multilingual datasets further broadens the applicability of UnrealText across diverse language contexts, supporting applications from autonomous vehicle navigation to real-time language translation.

Future developments could focus on further enhancing realism by integrating more complex environmental interactions or exploring procedural generation techniques to autonomously create scene diversity. Additionally, combining UnrealText with real data could improve domain adaptation capabilities, reducing performance gaps between training on synthetic versus real-world data.

In conclusion, UnrealText represents a pivotal development in the creation of synthetic datasets for scene text detection and recognition, offering scalable and versatile solutions to meet the data demands of contemporary machine learning models. Such advancements allocate significant potentials for expanding research in multilingual scene text recognition and general computer vision tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Shangbang Long (13 papers)
Cong Yao (70 papers)

Citations (64)

View on Semantic Scholar

Related Papers

Find Related Papers

UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World (2003.10608v6)