Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rosetta: Large scale system for text detection and recognition in images (1910.05085v1)

Published 11 Oct 2019 in cs.CV

Abstract: In this paper we present a deployed, scalable optical character recognition (OCR) system, which we call Rosetta, designed to process images uploaded daily at Facebook scale. Sharing of image content has become one of the primary ways to communicate information among internet users within social networks such as Facebook and Instagram, and the understanding of such media, including its textual information, is of paramount importance to facilitate search and recommendation applications. We present modeling techniques for efficient detection and recognition of text in images and describe Rosetta's system architecture. We perform extensive evaluation of presented technologies, explain useful practical approaches to build an OCR system at scale, and provide insightful intuitions as to why and how certain components work based on the lessons learnt during the development and deployment of the system.

Comprehensive Analysis of Facebook's Rosetta: A Scalable OCR System

The paper "Rosetta: Large scale system for text detection and recognition in images" presents an optical character recognition (OCR) system developed and deployed by Facebook. This system is designed to operate on images uploaded daily at a volume comparable to Facebook's extensive user base, highlighting its scalability and efficiency in processing large datasets typical of social media environments like Facebook and Instagram.

Overview of Rosetta's Architecture

The architecture of Rosetta is segmented into two primary components: text detection and text recognition, each independently optimized for real-time processing of significant quantities of visual data. This bidirectional split is advantageous, catering to parallel processing capabilities and providing modular adaptability for training and model updates.

  1. Text Detection: Leveraging a Faster-RCNN based framework, Rosetta detects regions in an image that likely contain text. This stage utilizes a convolutional neural network (CNN), namely ShuffleNet for its detection backbone, and has been fine-tuned through stages using a mix of synthetic, COCO-Text, and annotated human datasets.
  2. Text Recognition: Post-detection, a character-based recognition model processes detected text regions, employing a fully-convolutional model trained using a Connectionist Temporal Classification (CTC) loss. Notably, this model represents text recognition in a lexicon-free manner, offering flexibility in recognizing text across a variety of languages and formats—vital for a platform like Facebook.

Performance and Empirical Findings

The system's performance is characterized using metrics such as mean average precision (mAP) for detection tasks and accuracy/edit distance for recognition tasks. Rosetta's methodology—synthetic data pre-training followed by fine-tuning with COCO-Text and human-annotated datasets—resulted in a substantial enhancement in detection precision, raising mAP by 57% over base models trained solely on synthetic data.

Further technical adjustments, like replacing standard Non-Maximum Suppression (NMS) with SoftNMS, led to incremental mAP improvements. The text recognition model's transition to employing a fully-convolutional CTC architecture marks a notable evolution over traditional character model frameworks, significantly reducing the model's parameter complexity and enhancing recognition accuracy by 48.06%.

Implications and Future Developments

Practically, Rosetta demonstrates a powerful blueprint for building OCR systems at scale, integrating seamless processing with real-time capabilities crucial for modern social media platforms. The capability of handling a broad spectrum of text styles and languages without a fixed dictionary indicates Rosetta's adaptability and advanced error management.

In theoretical terms, Rosetta serves as a testament to the development of highly efficient CNN applications even with extensive input variability. The use of curriculum learning within the CTC models paves the way for improved training processes for sequence-to-sequence tasks beyond OCR.

Looking forward, the advancement of such systems inevitably intersects with ongoing AI developments in natural language processing and computer vision integration. As Facebook scales further computational abilities, possibilities arise for integrating more sophisticated language and context understanding within visual data recognition systems. This can lead to not only enhanced textual recognition but also deeper semantic interpretations of content across multimedia platforms.

In conclusion, Rosetta exemplifies a state-of-the-art OCR system balancing accuracy, speed, and scalability, driving further exploration into optimizing OCR for varied applications within vast datasets inherent in social media ecosystems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Fedor Borisyuk (13 papers)
  2. Albert Gordo (18 papers)
  3. Viswanath Sivakumar (6 papers)
Citations (282)