Reading Text in the Wild with Convolutional Neural Networks (1412.1842v1)

Published 4 Dec 2014 in cs.CV

Abstract: In this work we present an end-to-end system for text spotting -- localising and recognising text in natural scene images -- and text based image retrieval. This system is based on a region proposal mechanism for detection and deep convolutional neural networks for recognition. Our pipeline uses a novel combination of complementary proposal generation techniques to ensure high recall, and a fast subsequent filtering stage for improving precision. For the recognition and ranking of proposals, we train very large convolutional neural networks to perform word recognition on the whole proposal region at the same time, departing from the character classifier based systems of the past. These networks are trained solely on data produced by a synthetic text generation engine, requiring no human labelled data. Analysing the stages of our pipeline, we show state-of-the-art performance throughout. We perform rigorous experiments across a number of standard end-to-end text spotting benchmarks and text-based image retrieval datasets, showing a large improvement over all previous methods. Finally, we demonstrate a real-world application of our text spotting system to allow thousands of hours of news footage to be instantly searchable via a text query.

PDF Abstract

Overview of "Reading Text in the Wild with Convolutional Neural Networks"

The paper "Reading Text in the Wild with Convolutional Neural Networks" was authored by Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. The authors propose an end-to-end system designed for text spotting—localizing and recognizing text within natural scene images. This pipeline capitalizes on a combination of region proposal mechanisms and deep Convolutional Neural Networks (CNNs) to tackle the dual tasks of text detection and word recognition.

Key Contributions

Novel Text Recognition Method: The paper presents a deep CNN focused on whole-word image recognition via multi-class classification across a 90k-word dictionary. This model departs from traditional character-based recognition approaches. Training is performed exclusively on synthetic data, eliminating the need for human-labeled datasets.
Efficient Detection Strategy: The proposed system integrates region proposal techniques such as Edge Boxes and a trained Aggregate Channel Features (ACF) detector, ensuring high recall for word bounding boxes. Subsequent filtering using a random forest classifier and bounding box regression using a CNN refine these proposals.
Application to Large-Scale Visual Search: A practical application of the pipeline is demonstrated for searching text in a large corpus of archived news footage. The system can retrieve relevant images or video frames in less than a second based on text queries.

Methodology

The approach consists of several stages:

Word Bounding Box Proposal: The system generates candidate word bounding boxes using the Edge Boxes method and an ACF detector. These methods are combined to achieve a high recall rate, around 98% on standard datasets like ICDAR 2003 and Street View Text (SVT).
Filtering and Refinement: Proposals are filtered using a random forest classifier to reduce false positives. A CNN is then used for bounding box regression, increasing the accuracy of bounding box coordinates significantly.
Text Recognition: The text recognition stage employs a deep CNN trained on synthetic data. The network performs word classification across a large dictionary, providing state-of-the-art performance on real-world text recognition tasks without the need for real-world labeled training data.
Merging and Ranking: Final merging of detections involves using the recognition output to apply non-maximal suppression (NMS) and bounding box regression iteratively. This results in precise localization and recognition of text.

Experimental Results

The system's efficacy is validated through rigorous testing on various benchmarks:

Text Spotting:

The pipeline consistently outperforms existing methods across multiple datasets such as ICDAR 2003, SVT, and IIIT5k. For example, it achieves a significant improvement in F-measure on IC03-50 (90%) and SVT-50 (76%).

Text Recognition:

The text recognition CNN achieves superior accuracy across datasets even when trained solely on synthetic data. It attains 98.7% on IC03-50 and 95.4% on SVT-50.

Implications and Future Directions

The practical implications of this research are profound, especially for large-scale image retrieval and video search by text. The ability to process millions of images rapidly and accurately opens new avenues in digital archiving and media management.

Theoretically, the ability to train effective models purely on synthetic data points toward new directions in data generation and augmentation for training deep learning models. This could significantly reduce the dependency on costly and time-consuming human labeling.

Future developments could include extending the recognition model to support multiple languages and scripts, thereby making it more universally applicable. Exploring new detection methods to handle extreme variations in text orientations and styles would also be beneficial.

In conclusion, this paper presents an effective and scalable solution to the text spotting problem, achieving significant benchmark improvements. The use of synthetic data for training and the integration of efficient region proposal techniques and CNNs are pivotal elements of its success.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Max Jaderberg (26 papers)
Karen Simonyan (54 papers)
Andrea Vedaldi (195 papers)
Andrew Zisserman (248 papers)

Citations (1,134)

View on Semantic Scholar