Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes (1904.06535v1)

Published 13 Apr 2019 in cs.CV

Abstract: Previous scene text detection methods have progressed substantially over the past years. However, limited by the receptive field of CNNs and the simple representations like rectangle bounding box or quadrangle adopted to describe text, previous methods may fall short when dealing with more challenging text instances, such as extremely long text and arbitrarily shaped text. To address these two problems, we present a novel text detector namely LOMO, which localizes the text progressively for multiple times (or in other word, LOok More than Once). LOMO consists of a direct regressor (DR), an iterative refinement module (IRM) and a shape expression module (SEM). At first, text proposals in the form of quadrangle are generated by DR branch. Next, IRM progressively perceives the entire long text by iterative refinement based on the extracted feature blocks of preliminary proposals. Finally, a SEM is introduced to reconstruct more precise representation of irregular text by considering the geometry properties of text instance, including text region, text center line and border offsets. The state-of-the-art results on several public benchmarks including ICDAR2017-RCTW, SCUT-CTW1500, Total-Text, ICDAR2015 and ICDAR17-MLT confirm the striking robustness and effectiveness of LOMO.

View on arXiv

Authors (7)

Chengquan Zhang (29 papers)
Borong Liang (5 papers)
Zuming Huang (5 papers)
Mengyi En (2 papers)
Junyu Han (53 papers)
Errui Ding (156 papers)
Xinghao Ding (66 papers)

Citations (225)

View on Semantic Scholar

Summary

Overview of "Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes"

The focal subject of the paper "Look More Than Once: An Accurate Detector for Text of Arbitrary Shapes" is advancing scene text detection technology to robustly identify diverse text forms encountered in real-world environments. Traditional techniques often falter when faced with extremely long or arbitrarily shaped text due to the intrinsic limitations of Convolutional Neural Networks (CNNs) and simplified text representation methods such as bounding boxes. The paper introduces an innovative approach, the LOMO detector, designed to overcome these challenges through three synergistic components: Direct Regressor (DR), Iterative Refinement Module (IRM), and Shape Expression Module (SEM).

LOMO's core innovation lies in the integration of an iterative refinement process and a sophisticated shape representation mechanism, which collectively improve the identification and precise localization of diverse textual forms. The methodology advances beyond conventional single-pass detection methods by enabling iterative perception refinement, much akin to human visual processing, hence the name "Look More Than Once."

Technical Contributions

Iterative Refinement Module (IRM): This component enhances the initial text localization by progressively refining the bounding quadrangle outputs from the DR. The IRM effectively extends the receptive field and encodes long-distance text dependencies, enabling more accurate long text detection. The paper reports notable increases in detection performance with multiple iterations, suggesting that the IRM significantly mitigates the endemic receptive field limitation of standard CNNs.
Shape Expression Module (SEM): Building on ideas from instance segmentation paradigms, SEM reconstructs textual shape expressions using a finer level of detail. By considering text regions, center lines, and border offsets, SEM provides a robust mechanism to encapsulate irregular text geometric properties. This module demonstrates the efficacy of leveraging a polygon-based framework for detecting curved and wavy text, showing substantial performance improvements on datasets characterized by such text geometries.

Results and Comparisons

LOMO was evaluated against several established benchmarks, including ICDAR2017-RCTW and SCUT-CTW1500, which feature scenarios with long and curved text, respectively. The paper reports state-of-the-art performance metrics, with the LOMO configuration outperforming prior methods in precision and recall across multiple benchmarks. Notable achievements include a significant leap in Hmean on ICDAR2017-RCTW, validating LOMO's effectiveness in handling multi-lingual and arbitrarily oriented text.

Implications and Future Directions

The proposed LOMO framework signifies a pivotal advancement towards generalized text detection capabilities, effectively addressing limitations of prior methods with respect to text length, orientation, and shape variability. The implications are broad-ranging, with potential enhancements in applications such as OCR systems, autonomous vehicle navigation, and augmented reality interfaces, where robust text detection is critical.

Future work could potentially explore the integration of LOMO with end-to-end text recognition systems to further refine textual interpretation tasks in complex environments. Additionally, extending the LOMO approach to seamlessly handle multi-lingual text recognition within a unified framework could present a significant step forward in the development of globalized text-centric applications. The presented work lays a solid foundation for continued exploration into adaptive learning frameworks that mimic human-like iterative visual processing in machine intelligence.

PDF Markdown

Related Papers

Find Related Papers