Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisiting Scene Text Recognition: A Data Perspective (2307.08723v2)

Published 17 Jul 2023 in cs.CV

Abstract: This paper aims to re-assess scene text recognition (STR) from a data-oriented perspective. We begin by revisiting the six commonly used benchmarks in STR and observe a trend of performance saturation, whereby only 2.91% of the benchmark images cannot be accurately recognized by an ensemble of 13 representative models. While these results are impressive and suggest that STR could be considered solved, however, we argue that this is primarily due to the less challenging nature of the common benchmarks, thus concealing the underlying issues that STR faces. To this end, we consolidate a large-scale real STR dataset, namely Union14M, which comprises 4 million labeled images and 10 million unlabeled images, to assess the performance of STR models in more complex real-world scenarios. Our experiments demonstrate that the 13 models can only achieve an average accuracy of 66.53% on the 4 million labeled images, indicating that STR still faces numerous challenges in the real world. By analyzing the error patterns of the 13 models, we identify seven open challenges in STR and develop a challenge-driven benchmark consisting of eight distinct subsets to facilitate further progress in the field. Our exploration demonstrates that STR is far from being solved and leveraging data may be a promising solution. In this regard, we find that utilizing the 10 million unlabeled images through self-supervised pre-training can significantly improve the robustness of STR model in real-world scenarios and leads to state-of-the-art performance.

Revisiting Scene Text Recognition: A Data Perspective

The paper explores the current state and challenges of Scene Text Recognition (STR), presenting a critical assessment from a data-oriented standpoint. The authors emphasize that the advancement in STR models, primarily benchmarked on well-established datasets, may not truly reflect their efficacy in real-world applications.

Examination of Saturation in Common Benchmarks

The authors first address the puzzling trend of performance saturation on six commonly-utilized STR benchmarks. An ensemble of 13 widely-adopted STR models failed to recognize only a small fraction (2.91%) from a total of 7672 test images. This prompted a suspicion that these benchmarks have become trivially solvable and are inadequate for fostering further innovation in the field. The limited challenge presented by these datasets might lead to a premature conclusion that STR challenges have been fully managed.

Introduction of Union14M Dataset

To address the inadequacies of existing benchmarks, the paper introduces the Union14M dataset. This large-scale, real-world data set comprises 4 million labeled and 10 million unlabeled images. Union14M is intended to better replicate the complexities and variabilities of real-world scene text, thus offering a more rigorous evaluation and training ground for STR models. When tested against Union14M, the previously mentioned STR models achieved an average accuracy of 66.53%, underscoring the ongoing difficulties these models encounter when faced with practical variability and complexity.

Identified Challenges and Benchmarking

The paper identifies seven categories of challenges within the STR models: curve text, multi-oriented text, contextless text, artistic text, multi-words text, salient text, and incomplete text. These challenges redirect attention to issues not currently addressed sufficiently by standard benchmarks. Consequently, the authors propose a new challenge-driven benchmark consisting of eight curated subsets of real-world text scenarios, enabling focused research towards overcoming these barriers.

Exploiting Unlabeled Data for Model Enhancement

The paper significantly explores the potential of leveraging unlabeled images via self-supervised pre-training. By employing a Vision Transformer model that utilizes the 10M unlabeled subset of Union14M for training, the authors demonstrate considerable improvements in robustness, achieving state-of-the-art results in more challenging real-world evaluations. This suggests a strong avenue for practical STR improvements through the exploitation of unguided data learning methodologies.

Implications and Future Directions

This analysis provides critical insights into STR model limitations, showing the necessity for richer and more diverse datasets that mirror real-world complexity. The findings underscore the importance of revising benchmarking practices to promote genuine advancements rather than surface-level performance improvements. Furthermore, the success of self-supervised learning techniques highlights a promising research path, capable of leveraging the unannotated data's potential to boost model robustness and applicability.

Through its comprehensive approach, the paper acts as a beacon for future research, encouraging the development of more resilient STR models. The academic community is invited to venture beyond traditional data paradigms, integrating novel datasets and self-supervised methodologies to achieve breakthrough results in real-world scene text recognition.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Qing Jiang (30 papers)
  2. Jiapeng Wang (22 papers)
  3. Dezhi Peng (21 papers)
  4. Chongyu Liu (12 papers)
  5. Lianwen Jin (116 papers)
Citations (32)