Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition -- RRC-MLT-2019 (1907.00945v1)

Published 1 Jul 2019 in cs.CV

Abstract: With the growing cosmopolitan culture of modern cities, the need of robust Multi-Lingual scene Text (MLT) detection and recognition systems has never been more immense. With the goal to systematically benchmark and push the state-of-the-art forward, the proposed competition builds on top of the RRC-MLT-2017 with an additional end-to-end task, an additional language in the real images dataset, a large scale multi-lingual synthetic dataset to assist the training, and a baseline End-to-End recognition method. The real dataset consists of 20,000 images containing text from 10 languages. The challenge has 4 tasks covering various aspects of multi-lingual scene text: (a) text detection, (b) cropped word script classification, (c) joint text detection and script classification and (d) end-to-end detection and recognition. In total, the competition received 60 submissions from the research and industrial communities. This paper presents the dataset, the tasks and the findings of the presented RRC-MLT-2019 challenge.

An Overview of the ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition

The paper presents an extensive analysis of the ICDAR2019 Robust Reading Challenge (RRC-MLT-2019), focusing on multi-lingual scene text detection and recognition tasks. It is an extension of the 2017 edition, enhancing its scope by including an additional language and introducing an end-to-end recognition task. The main objective of the competition was to benchmark and advance the state-of-the-art in multi-lingual text detection and recognition by leveraging both real and synthetic datasets.

Dataset Overview

The paper introduces a comprehensive dataset comprising 20,000 real images featuring text from ten languages: Arabic, Bangla, Chinese, Devanagari, English, French, German, Italian, Japanese, and Korean. To aid in training, a large-scale synthetic dataset featuring 277,000 images is also provided. This dual dataset structure is aligned to target the training needs of robust multi-lingual scene text recognition systems. The real dataset is evenly split between training and testing, with annotations at the word level, while the synthetic data is designed to compliment the real dataset, specifically aiding the new recognition task introduced.

Challenge Tasks and Evaluation

The challenge consisted of four tasks:

  1. Multi-lingual Text Detection: Focused on accurately localizing text at the word level. The evaluation was based on the f-measure derived from precision and recall metrics.
  2. Cropped Word Script Identification: A classification task where the aim was to determine the script of a cropped word image from the set eight distinct script identifiers. Classification accuracy was the evaluation metric employed.
  3. Joint Text Detection and Script Identification: This task combined detection and script classification, relying on both accurate localization and correct script identification to score a match.
  4. End-to-End Text Detection and Recognition: A challenging task where systems needed to both detect and recognize text, requiring methodical integration of previous task capabilities.

The evaluation of each task was carefully designed to ensure robust assessment of participants' methods, leveraging established metrics such as f-measure and classification accuracy. This methodological rigor provides a reliable benchmark for current and future methodologies in the field.

Participants and Results

The competition attracted significant participation with a total of 60 submissions across the tasks. Notably, methods based on R-CNN derivatives, such as Mask-RCNN, and integration with deep learning architectures like ResNet, VGG16, and Seq2Seq dominated the top ranks. The results emphasize the effectiveness of combining detection and recognition pipelines with modern neural architectures for scene text interpretation.

Across all tasks, the "Tencent-DPPR Team" frequently emerged as a leader, reflecting the competitive edge of their ensemble approach, which fused multiple recognition models with robust statistical analyses for script identification.

Implications and Future Directions

The insights gained from RRC-MLT-2019 underscore the increasing complexity and necessity for advancements in multi-lingual text recognition systems. The inclusion of an end-to-end recognition task and the provision of synthetic data mark significant steps towards versatile OCR systems capable of handling multi-script challenges in dynamic, real-world scenarios.

Future research should expand upon this framework by integrating more languages and scripts, enhancing dataset diversity and complexity. Furthermore, evolving the evaluation protocols to handle unfocused or distorted text could enhance the robustness of future models. This paper lays the groundwork for such advancement by providing established benchmarks and opening new avenues for research in the field of robust, multi-lingual scene text recognition.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Nibal Nayef (1 paper)
  2. Yash Patel (41 papers)
  3. Pinaki Nath Chowdhury (37 papers)
  4. Dimosthenis Karatzas (80 papers)
  5. Wafa Khlif (1 paper)
  6. Jiri Matas (133 papers)
  7. Umapada Pal (80 papers)
  8. Jean-Christophe Burie (11 papers)
  9. Jean-Marc Ogier (10 papers)
  10. Cheng-Lin Liu (71 papers)
  11. Michal Busta (2 papers)
Citations (227)