An Overview of the ICDAR2019 Robust Reading Challenge on Multi-lingual Scene Text Detection and Recognition
The paper presents an extensive analysis of the ICDAR2019 Robust Reading Challenge (RRC-MLT-2019), focusing on multi-lingual scene text detection and recognition tasks. It is an extension of the 2017 edition, enhancing its scope by including an additional language and introducing an end-to-end recognition task. The main objective of the competition was to benchmark and advance the state-of-the-art in multi-lingual text detection and recognition by leveraging both real and synthetic datasets.
Dataset Overview
The paper introduces a comprehensive dataset comprising 20,000 real images featuring text from ten languages: Arabic, Bangla, Chinese, Devanagari, English, French, German, Italian, Japanese, and Korean. To aid in training, a large-scale synthetic dataset featuring 277,000 images is also provided. This dual dataset structure is aligned to target the training needs of robust multi-lingual scene text recognition systems. The real dataset is evenly split between training and testing, with annotations at the word level, while the synthetic data is designed to compliment the real dataset, specifically aiding the new recognition task introduced.
Challenge Tasks and Evaluation
The challenge consisted of four tasks:
- Multi-lingual Text Detection: Focused on accurately localizing text at the word level. The evaluation was based on the f-measure derived from precision and recall metrics.
- Cropped Word Script Identification: A classification task where the aim was to determine the script of a cropped word image from the set eight distinct script identifiers. Classification accuracy was the evaluation metric employed.
- Joint Text Detection and Script Identification: This task combined detection and script classification, relying on both accurate localization and correct script identification to score a match.
- End-to-End Text Detection and Recognition: A challenging task where systems needed to both detect and recognize text, requiring methodical integration of previous task capabilities.
The evaluation of each task was carefully designed to ensure robust assessment of participants' methods, leveraging established metrics such as f-measure and classification accuracy. This methodological rigor provides a reliable benchmark for current and future methodologies in the field.
Participants and Results
The competition attracted significant participation with a total of 60 submissions across the tasks. Notably, methods based on R-CNN derivatives, such as Mask-RCNN, and integration with deep learning architectures like ResNet, VGG16, and Seq2Seq dominated the top ranks. The results emphasize the effectiveness of combining detection and recognition pipelines with modern neural architectures for scene text interpretation.
Across all tasks, the "Tencent-DPPR Team" frequently emerged as a leader, reflecting the competitive edge of their ensemble approach, which fused multiple recognition models with robust statistical analyses for script identification.
Implications and Future Directions
The insights gained from RRC-MLT-2019 underscore the increasing complexity and necessity for advancements in multi-lingual text recognition systems. The inclusion of an end-to-end recognition task and the provision of synthetic data mark significant steps towards versatile OCR systems capable of handling multi-script challenges in dynamic, real-world scenarios.
Future research should expand upon this framework by integrating more languages and scripts, enhancing dataset diversity and complexity. Furthermore, evolving the evaluation protocols to handle unfocused or distorted text could enhance the robustness of future models. This paper lays the groundwork for such advancement by providing established benchmarks and opening new avenues for research in the field of robust, multi-lingual scene text recognition.