Scene Text Detection with Supervised Pyramid Context Network (1811.08605v1)

Published 21 Nov 2018 in cs.CV

Abstract: Scene text detection methods based on deep learning have achieved remarkable results over the past years. However, due to the high diversity and complexity of natural scenes, previous state-of-the-art text detection methods may still produce a considerable amount of false positives, when applied to images captured in real-world environments. To tackle this issue, mainly inspired by Mask R-CNN, we propose in this paper an effective model for scene text detection, which is based on Feature Pyramid Network (FPN) and instance segmentation. We propose a supervised pyramid context network (SPCNET) to precisely locate text regions while suppressing false positives. Benefited from the guidance of semantic information and sharing FPN, SPCNET obtains significantly enhanced performance while introducing marginal extra computation. Experiments on standard datasets demonstrate that our SPCNET clearly outperforms start-of-the-art methods. Specifically, it achieves an F-measure of 92.1% on ICDAR2013, 87.2% on ICDAR2015, 74.1% on ICDAR2017 MLT and 82.9% on Total-Text.

PDF Abstract

Analysis of Scene Text Detection with Supervised Pyramid Context Network

The paper "Scene Text Detection with Supervised Pyramid Context Network" introduces an innovative approach to tackle challenges in scene text detection. The method, coined as the Supervised Pyramid Context Network (SPCNET), aims to improve the detection of text within natural scenes by addressing issues related to false positives (FP) and the flexibility required to detect text in arbitrary shapes.

Problem Statement and Challenges

Detecting text in natural scenes presents numerous challenges due to the diversity in text appearance, including variations in shape, color, font, orientation, and scale. Environmental factors such as lighting and occlusion further complicate the task. While previous deep learning methods have achieved substantial improvements, a significant drawback remains—dealing with the prevalence of false positives in complex scenes. Autonomous driving and other applications necessitate precision in text localization, making reduction in false positives critical. Furthermore, the task of locating text of arbitrary shapes, including multi-oriented and curved forms, remains incompletely addressed by existing methodologies.

Proposed Method: SPCNET

SPCNET is inspired by contemporary instance segmentation techniques, specifically Mask R-CNN, and leverages Feature Pyramid Networks (FPN) to develop a more nuanced text detection mechanism. The proposed approach is characterized by two components: the Text Context Module (TCM) and the Re-Score mechanism.

Text Context Module (TCM): This module enhances feature extraction by embedding both attention-based and context-aware capabilities. The Pyramid Attention Module within TCM improves the discriminative power of text features via semantic segmentation, while the Pyramid Fusion Module integrates these enhanced features with the pipeline to reduce false positives, thereby achieving richer instance-level context information.
Re-Score Mechanism: This novel mechanism redefines the scoring of detected text instances by combining classification scores with instance-level semantic segmentation outputs. This composite score mitigates inaccuracies typically introduced by text orientation and improves detection confidence, particularly for non-traditionally oriented text instances.

Experimental Evaluation

The paper presents a robust experimental evaluation across a suite of standard text detection benchmarks: ICDAR2013, ICDAR2015, ICDAR2017 MLT, and Total-Text. SPCNET demonstrated significant performance gains, achieving an F-measure of 92.1% on ICDAR2013, 87.2% on ICDAR2015, 74.1% on ICDAR2017 MLT, and 82.9% on Total-Text. These results reflect SPCNET’s capacity to outperform current state-of-the-art methods, not only in terms of precision but also in effectively minimizing false positives across varying types of text benchmarks, from horizontal to multi-lingual and curved text instances.

Implications and Future Directions

The advancements introduced by SPCNET have several implications. Practically, reducing false positives while enhancing detection accuracy can improve the reliability of systems that depend on scene text detection, such as navigation systems and automated content analysis. Theoretically, the integration of contextual cues via TCM, alongside dynamic re-scoring strategies, paves the way for improved models in object detection and segmentation tasks beyond text detection.

Looking forward, the authors intend to refine the re-scoring mechanism to enhance end-to-end training integration further and explore the method's application to other domains requiring orientation-invariant object detection, such as aerial imagery. Additionally, exploring lightweight architectures could enable the deployment of these robust detection systems on mobile and edge devices, expanding their applicability and accessibility.

In conclusion, the research presented in this paper offers a valuable contribution to the field of scene text detection by proposing a method that significantly alleviates false positives and adapts to diverse text presentation challenges inherent in natural scenes.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Enze Xie (84 papers)
Yuhang Zang (54 papers)
Shuai Shao (57 papers)
Gang Yu (114 papers)
Cong Yao (70 papers)
Guangyao Li (37 papers)

Citations (220)

View on Semantic Scholar