Synthetic Data for Text Localisation in Natural Images (1604.06646v1)

Published 22 Apr 2016 in cs.CV

Abstract: In this paper we introduce a new method for text detection in natural images. The method comprises two contributions: First, a fast and scalable engine to generate synthetic images of text in clutter. This engine overlays synthetic text to existing background images in a natural way, accounting for the local 3D scene geometry. Second, we use the synthetic images to train a Fully-Convolutional Regression Network (FCRN) which efficiently performs text detection and bounding-box regression at all locations and multiple scales in an image. We discuss the relation of FCRN to the recently-introduced YOLO detector, as well as other end-to-end object detection systems based on deep learning. The resulting detection network significantly out performs current methods for text detection in natural images, achieving an F-measure of 84.2% on the standard ICDAR 2013 benchmark. Furthermore, it can process 15 images per second on a GPU.

Authors (3)

Ankush Gupta (19 papers)
Andrea Vedaldi (195 papers)
Andrew Zisserman (248 papers)

Citations (1,374)

View on Semantic Scholar

Summary

Synthetic Data for Text Localization in Natural Images: An Expert Overview

In this paper, Gupta, Vedaldi, and Zisserman present novel contributions to text localization in natural images, leveraging synthetic data generation and a new deep learning architecture termed Fully-Convolutional Regression Network (FCRN). This method marks a move away from the traditional approaches in text spotting by introducing synthetic images that simulate natural scenes, and a robust deep learning model optimized for fast and accurate text detection.

Synthesis of Training Data

The paper’s first major contribution is the development of a scalable engine that creates synthetic images of text integrated into various natural background scenes. The synthesis process involves several steps:

Text and Background Sampling: Extracting text from the Newsgroup20 dataset and selecting clean background images from Google Image Search, ensuring no pre-existing text.
Segmentation and Depth Estimation: Employing the gPb-UCM contour detector and a CNN to predict depth maps, allowing the text to be aligned with the 3D geometry of the scene.
Text Rendering and Blending: Text colors are selected based on background regions, and rendering exploits Poisson image editing to blend text naturally into the scene.

These steps ensure that generated images respect the physical constraints of natural environments, including local color and texture coherence, and 3D perspectives. They result in SynthText in the Wild, a synthetic dataset comprising 800,000 images annotated at the word level.

Fully-Convolutional Regression Network (FCRN)

The second contribution lies in the text detection methodology via a new model, FCRN. This network is designed to predict text presence and bounding boxes across multiple scales in a single forward pass over the image, significantly enhancing efficiency. The architecture includes:

Feature Extraction: Using a stack of convolutional layers analogous to the VGG-16 network, albeit much smaller to enhance efficiency.
Dense Regression: Each grid cell predicts object presence and bounding box parameters, inspired by the YOLO (You Only Look Once) framework but differing in that FCRN employs shared weights among predictors for translation invariance.
Multi-Scale Detection: To handle text instances of varying sizes, the model processes images at multiple scaled resolutions, merging results via non-maximal suppression.

Empirical Evaluation

The system's performance is validated on multiple benchmarks: ICDAR 2011, ICADR 2013, and the SVT datasets. Key results include:

Text Localization: The FCRN outperforms existing methods, achieving an F-measure of 84.2% on the ICDAR 2013 benchmark. Notably, the multi-scale application of FCRN improves recall compared to single-scale detection.
Speed: Multi-scale FCRN processes up to 15 images per second on a GPU. This is a significant improvement over prior methods, which typically take seconds per image.
Ablation Studies: The investigations into various synthetic data configurations reveal that blending text respecting local regions' color and texture boundaries is critical for realistic data generation, substantially boosting model performance.

Theoretical and Practical Implications

The work has several implications:

Synthetic Data Viability: The successful application of synthetic data to train complex models marks a significant shift, suggesting that real-world-like training samples can be synthetically generated at scale.
End-to-End Text Spotting: By integrating their detection pipeline with a recognition model, end-to-end text spotting performance is substantially enhanced, as reflected by higher F-measures compared to previous methods.
Future Developments: This paper opens avenues for further refinement in synthetic data techniques and architectural optimizations for related applications in computer vision, such as object detection and scene understanding.

Conclusion

Gupta, Vedaldi, and Zisserman's paper pushes the envelope in text localization through synthetic data and an innovative FCRN model, offering a robust solution with improved speed and accuracy. The approach exemplifies the potential of synthetic datasets in overcoming the annotated data scarcity, suggesting future research in AI might increasingly leverage synthetic, yet realistic, training data for diverse applications.

PDF Markdown