Edit Probability for Scene Text Recognition (1805.03384v1)

Published 9 May 2018 in cs.CV

Abstract: We consider the scene text recognition problem under the attention-based encoder-decoder framework, which is the state of the art. The existing methods usually employ a frame-wise maximal likelihood loss to optimize the models. When we train the model, the misalignment between the ground truth strings and the attention's output sequences of probability distribution, which is caused by missing or superfluous characters, will confuse and mislead the training process, and consequently make the training costly and degrade the recognition accuracy. To handle this problem, we propose a novel method called edit probability (EP) for scene text recognition. EP tries to effectively estimate the probability of generating a string from the output sequence of probability distribution conditioned on the input image, while considering the possible occurrences of missing/superfluous characters. The advantage lies in that the training process can focus on the missing, superfluous and unrecognized characters, and thus the impact of the misalignment problem can be alleviated or even overcome. We conduct extensive experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets. Experimental results show that the EP can substantially boost scene text recognition performance.

Authors (5)

Fan Bai (38 papers)
Zhanzhan Cheng (28 papers)
Yi Niu (38 papers)
Shiliang Pu (106 papers)
Shuigeng Zhou (81 papers)

Citations (161)

View on Semantic Scholar

Summary

Edit Probability for Scene Text Recognition: A Technical Overview

In the paper titled "Edit Probability for Scene Text Recognition," the authors tackle a prevalent challenge in the field of scene text recognition, particularly within the attention-based encoder-decoder framework. The core issue identified is the misalignment between ground-truth sequences and attention outputs, which disrupts model training by introducing errors due to missing or superfluous characters. The authors propose a novel method called Edit Probability (EP) designed to address this issue and enhance recognition performance.

Motivation and Problem Statement

Scene text recognition is a critical component of computer vision, especially with the rise of applications requiring the interpretation of text within natural environments. Existing methods predominantly utilize frame-wise maximal likelihood losses which inadequately handle sequence misalignments. This misalignment occurs when trained models misinterpret character positions, either omitting necessary characters or adding superfluous ones.

The paper cites prior work, such as the Focusing Attention Network (FAN) introduced by Cheng et al., which addressed attention drift but at the cost of requiring additional pixel-wise supervision. This limitation prompted the need for an alternative approach that minimizes training costs and improves accuracy without additional annotation burdens.

Introducing Edit Probability (EP)

The authors introduce EP as a solution to better estimate the probability of generating strings from output sequences by explicitly accounting for possible misalignments. The EP framework comprises several innovative components:

EP-Based Attention Decoder: Enhancements to the decoder not only predict character sequences but also probabilities for certain characters being missing or superfluous.
Edit Operations: These include Consumption, Insertion, and Deletion operations, each associated with probabilities that contribute to the overall likelihood of generating a target sequence from an input image.
Dynamic Programming Approach: The paper details a dynamic programming strategy to efficiently compute the edit probability across possible sequences, overcoming computational challenges of enumerating every possible edit path.

Experimental Evaluation

The EP method was extensively evaluated against existing benchmarks like IIIT-5K, SVT, ICDAR datasets, demonstrating significant improvements in recognition accuracy. The performance gains were clear when comparing EP with baseline models restricted to frame-wise probability estimations. Notably, the EP-enhanced models excelled in lexicon-free scenarios and showed resilient performance using both constrained and unconstrained lexicons.

Key findings include:

Enhanced accuracy on major datasets, often outperforming the state-of-the-art FAN model without the need for excessive pixel-level supervision.
Significant accuracy improvements attributed to focusing the learning process on resolving character discrepancies.

Implications and Future Work

The implications of integrating EP in scene text recognition are broad. The method holds potential for improving other sequence-based problems like speech recognition and machine translation, given its ability to manage sequence misalignments robustly. The approach could scale to multi-modal scenarios, potentially benefiting applications in video understanding and augmented reality.

Future research directions could explore fine-tuning EP for specific languages or writing systems, optimizing computational efficiency further, or enhancing robustness in noisy environments typical in street-level imagery.

The authors acknowledge the Science and Technology Innovation Action Program of the STCSM for partial support, indicating ongoing interest and investment in advancing this crucial aspect of AI-driven text recognition.

Related Papers

Find Related Papers