COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

Published 26 Jan 2016 in cs.CV | (1601.07140v2)

Abstract: This paper describes the COCO-Text dataset. In recent years large-scale datasets like SUN and Imagenet drove the advancement of scene understanding and object recognition. The goal of COCO-Text is to advance state-of-the-art in text detection and recognition in natural images. The dataset is based on the MS COCO dataset, which contains images of complex everyday scenes. The images were not collected with text in mind and thus contain a broad variety of text instances. To reflect the diversity of text in natural scenes, we annotate text with (a) location in terms of a bounding box, (b) fine-grained classification into machine printed text and handwritten text, (c) classification into legible and illegible text, (d) script of the text and (e) transcriptions of legible text. The dataset contains over 173k text annotations in over 63k images. We provide a statistical analysis of the accuracy of our annotations. In addition, we present an analysis of three leading state-of-the-art photo Optical Character Recognition (OCR) approaches on our dataset. While scene text detection and recognition enjoys strong advances in recent years, we identify significant shortcomings motivating future work.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (480)

View on Semantic Scholar

Summary

The paper introduces COCO-Text, a new dataset derived from MS COCO, significantly expanding the scale and detail of text annotations in diverse natural scenes.
The study employs a dual annotation approach that integrates state-of-the-art OCR outputs with crowdsourced corrections to address complex challenges in text detection.
The evaluation reveals strong detection precision but limited recall for handwritten and illegible text, highlighting key areas for future research in robust text recognition.

COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images

The paper "COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images" by Andreas Veit et al. presents a comprehensive resource designed to enhance the state of the art in text detection and recognition within natural scenes. This paper outlines the creation of the COCO-Text dataset, a large-scale resource derived from the MS COCO dataset, which was originally focused on object recognition in complex, real-world environments. By leveraging the diversity of the MS COCO images, which were not specifically collected with a focus on text, the authors address a significant gap in the field by providing a dataset rich in varied text instances.

Annotated Dataset

The COCO-Text dataset contains 63,686 images and 173,589 text annotations, making it significantly larger than previously available datasets such as ICDAR 03 and ICDAR 15. Each text instance in the dataset is annotated with multiple attributes: location via bounding boxes, legibility status (legible or illegible), category (machine printed or handwritten), script (in English or not), and full transcriptions if the text is legible. The substantial size and detailed annotation make COCO-Text a robust dataset for developing and evaluating text detection and recognition algorithms.

Annotation and Challenges

Annotated using both automated and manual methods, the dataset employs a pipeline that integrates outputs from state-of-the-art OCR systems with human annotations via crowdsourcing. This dual approach helps manage the complexity and variability of text in natural images, addressing challenges such as text variability, occlusion, and diverse contexts. However, the accuracy of current OCR methods still shows limitations, with significant room for improvement, particularly concerning illegible and handwritten text.

Performance Evaluation

The authors evaluate three leading OCR algorithms on the COCO-Text dataset. While achieving high precision (up to 83.78% for detection precision), these algorithms demonstrate limited recall, particularly in detecting and recognizing handwritten and illegible text. This performance discrepancy highlights current shortcomings in the field and suggests directions for future research, such as developing algorithms capable of handling a broader range of text types with greater robustness.

Implications for Future Research

COCO-Text offers a significant opportunity for advancing the understanding and application of text recognition technology. By providing a comprehensive benchmark and a diverse set of text images, it facilitates the evaluation of algorithmic strengths and weaknesses, guiding future enhancements. The dataset's integration into the MS COCO framework further encourages the exploration of contextual relationships between text and objects in a scene, potentially improving the effectiveness of scene understanding systems.

Conclusion

The COCO-Text dataset is an instrumental resource for the computer vision community, as it combines scale, diversity, and detailed annotation to foster advancements in text detection and recognition. While current methodologies display certain capabilities, the dataset's scope and design illuminate the challenges that remain, offering a foundation for ongoing research and innovation in the domain. By addressing these challenges, researchers can aim to improve the real-world applicability of text recognition systems considerably.

Markdown Report Issue