ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification (1812.05824v3)

Published 14 Dec 2018 in cs.CV

Abstract: Automated recognition of texts in scenes has been a research challenge for years, largely due to the arbitrary variation of text appearances in perspective distortion, text line curvature, text styles and different types of imaging artifacts. The recent deep networks are capable of learning robust representations with respect to imaging artifacts and text style changes, but still face various problems while dealing with scene texts with perspective and curvature distortions. This paper presents an end-to-end trainable scene text recognition system (ESIR) that iteratively removes perspective distortion and text line curvature as driven by better scene text recognition performance. An innovative rectification network is developed which employs a novel line-fitting transformation to estimate the pose of text lines in scenes. In addition, an iterative rectification pipeline is developed where scene text distortions are corrected iteratively towards a fronto-parallel view. The ESIR is also robust to parameter initialization and the training needs only scene text images and word-level annotations as required by most scene text recognition systems. Extensive experiments over a number of public datasets show that the proposed ESIR is capable of rectifying scene text distortions accurately, achieving superior recognition performance for both normal scene text images and those suffering from perspective and curvature distortions.

PDF Abstract

An Overview of ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification

The paper "ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification" addresses challenges inherent to automated scene text recognition, particularly those resulting from arbitrary perspective distortions and complex text curvatures. Scene text recognition is pivotal in various applications, such as navigation and content-based image retrieval, yet it remains hindered by variations in text appearance, background complexity, and imaging artifacts. Contemporary deep learning models have significantly advanced the field but still struggle with texts affected by extreme perspective or curvature distortions.

The authors present ESIR, a novel end-to-end trainable system aimed at iteratively rectifying perspective distortion and text curvature to enhance recognition performance. A distinctive rectification network is introduced, employing a line-fitting transformation to estimate text line poses, coupled with an iterative rectification process to progressively correct distortions towards a fronto-parallel view. This system is designed to be robust against parameter initialization challenges and requires only scene text images with word-level annotations for training.

Key Methodological Contributions

Line-Fitting Transformation: The paper introduces a line-fitting transformation strategy, incorporating a polynomial that models the middle line of scene text and line segments to capture orientation and boundary information. This transformation is robust and flexible, capable of accurately modeling straight and curved text lines' poses, facilitating effective distortion correction.
Iterative Rectification Pipeline: ESIR features an iterative pipeline where distortions are corrected progressively. Unlike traditional methods that perform a single correction, the pipeline refines the corrections iteratively, driven by recognition feedback. The iterative nature not only improves correction accuracy but also reduces boundary effects and maintains image clarity by consistently referencing the original image for distortion estimation.
Recognition Network: Employing a sequence-to-sequence model with attention mechanisms, the recognition network adeptly handles the rectified text outputs, integrating bidirectional LSTM layers for visual feature sequences and LuongAttention for enhanced sequence generation.

Experimental Results and Implications

Experiments conducted on public datasets, including ICDAR2013, ICDAR2015, IIIT5K, SVT, SVTP, and CUTE, demonstrate ESIR's competitive advantage, particularly on datasets characterized by significant distortions. The iterative rectification ensures superior correction of distorted texts, thereby achieving higher recognition accuracy compared to state-of-the-art methods.

ESIR's approach could potentially influence both practical applications and theoretical research. Practically, it advances real-world scene text recognition systems' accuracy and robustness, enabling improved performance in environments with complex distortional artifacts. Theoretically, it introduces a promising direction in iterative refinement processes within the domain of computer vision, which could enrich model training dynamics and inspire similar iterative methodologies in other recognition tasks.

Future Directions

Looking ahead, the integration of ESIR with scene text detection models to create a fully optimized end-to-end reading system is proposed as a future development. Furthermore, exploration into refining the efficiency of the iterative process and extending this methodology to encompass detection tasks could yield significant advancements in automated text processing pipelines.

ESIR sets a precedent for approaching complex recognition tasks with iterative methodologies, showcasing substantial gains in scenarios where traditional approaches encounter limitations. As such, it represents a valuable contribution to the ongoing development of robust scene text recognition solutions in the field of computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Fangneng Zhan (53 papers)
Shijian Lu (151 papers)

Citations (246)

View on Semantic Scholar

ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification (1812.05824v3)

An Overview of ESIR: End-to-end Scene Text Recognition via Iterative Image Rectification

Key Methodological Contributions

Experimental Results and Implications

Future Directions

Related Papers