Multi-Granularity Prediction for Scene Text Recognition (2209.03592v2)

Published 8 Sep 2022 in cs.CV

Abstract: Scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this challenging problem, numerous innovative methods have been successively proposed and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model, which is built upon ViT and outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely-used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent LLM (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level. Specifically, it achieves an average recognition accuracy of 93.35% on standard benchmarks. Code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR.

PDF Abstract

Multi-Granularity Prediction for Scene Text Recognition

Scene Text Recognition (STR) presents a formidable challenge within the domain of computer vision. The paper, "Multi-Granularity Prediction for Scene Text Recognition" by Peng Wang et al., proposes a novel approach that enhances existing STR models by leveraging Vision Transformers (ViT) and introducing linguistic knowledge implicitly through subword tokenization. This method not only outperforms previous state-of-the-art models using both vision and linguistic context but also simplifies the integration process by avoiding an independent LLM.

Methodological Insights

The core proposition of this research is the construction of a vision-based STR model that leverages ViT, augmented with a specially designed Adaptive Addressing and Aggregation (A $^3$ ) module. This model establishes a robust baseline by already exceeding the performance of preceding methods relying solely on vision features. The innovative aspect lies in the Multi-Granularity Prediction (MGP) strategy. By incorporating subword representations such as Byte-Pair Encoding (BPE) and WordPiece into the output space, MGP-STR implicitly models linguistic information without the overhead of an explicit LLM.

Key Components

Vision Transformer Backbone: ViT serves as the backbone for extracting image features through direct processing of image patches, maintaining spatial structure and sequence information. The model operates efficiently with smaller patch sizes better suited for text recognition tasks than traditionally used sizes.
Adaptive Addressing and Aggregation (A $^3$ ) Module: The A $^3$ module employs spatial attention to selectively aggregate ViT tokens, effectively capturing character-level information for accurate text recognition.
Multi-Granularity Prediction: In contrast to a singular focus on character predictions, the model outputs predictions at multiple levels—character, subword, and word—which enriches the model’s linguistic understanding and robustness against low-quality textual images.
Fusion Strategy: The method adopts a decision-level fusion strategy based on prediction confidences (mean and cumulative product), enhancing accuracy by integrating multi-granular outputs.

Evaluation and Results

The model was rigorously tested on several standard benchmarks, achieving an impressive average recognition accuracy of 93.35%. Compared to renowned STR methods such as ABINet and SRN, MGP-STR demonstrated superior performance across a variety of datasets, including both regular and irregular text samples. The research also introduces multiple configurations using different ViT backbones, highlighting the adaptability and computational efficiency of the approach.

Implications and Future Directions

This work has significant implications in the STR context, offering a method that naturally integrates linguistic cues without the complexity of dedicated LLMs. It highlights the potential for Transformer-based architectures in text recognition tasks beyond natural language processing applications.

Future developments could explore the extension of multi-granular strategies to other areas in visual recognition tasks, as well as refining token aggregation techniques to optimize both efficiency and recognition capability. The adaptability of this method may also lead to breakthroughs in text recognition in more diverse and challenging environments, potentially influencing areas such as augmented reality and autonomous systems.

In conclusion, the "Multi-Granularity Prediction for Scene Text Recognition" presents a well-founded, highly effective approach that advances both theoretical understanding and practical applications in STR, positioning itself as a valuable contribution to the field.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Peng Wang (831 papers)
Cheng Da (7 papers)
Cong Yao (70 papers)

Citations (44)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

YouTube

Show All Videos