Shape Robust Text Detection with Progressive Scale Expansion Network
The paper introduces the Progressive Scale Expansion Network (PSENet), a novel segmentation-based approach designed for text detection in natural scenes, particularly addressing the challenges posed by texts of arbitrary shapes. The authors detail two primary limitations of existing methods: the difficulties faced by quadrangular bounding box-based detectors in accurately enclosing non-rectangular texts, and the challenges segmentation-based detectors encounter in separating closely located text instances.
Methodology
PSENet is structured to overcome these limitations through a progressive scale expansion methodology. This involves generating multiple predictions for each text instance. By progressively expanding these predictions from smaller, kernel-like segments to full text shapes, PSENet effectively distinguishes closely packed text instances and adapts to various text geometries.
The key components of the approach include:
- Multiple Scale Kernels: Instead of treating text instances as single segments, the method predicts multiple kernels by iteratively shrinking text boundaries to various scales. This enables precise control over text instance boundaries, allowing for the effective handling of texts with unusual shapes.
- Progressive Expansion: Starting from the minimal kernel, each kernel's scale is progressively increased using a Breadth-First-Search (BFS) inspired algorithm. This gradual expansion process ensures that adjacent text boundaries are meticulously managed, preventing merging of distinct text instances.
- Robustness and Effectiveness: By starting with minimal scale kernels, PSENet avoids the common pitfalls of merging text instances too early. Moreover, the progressive expansion aligns with smooth supervision for model learning, enhancing precision.
Results and Contributions
PSENet demonstrates state-of-the-art results on multiple benchmarks, including ICDAR 2015, ICDAR 2017 MLT, and SCUT-CTW1500, with particularly strong performance on datasets featuring curved text. Notably, on the SCUT-CTW1500 dataset, PSENet achieves an absolute improvement of 6.37% over previous best results. The architecture efficiently balances precision and recall, marking significant advancements over existing methods.
Implications and Future Directions
From a practical standpoint, PSENet's approach to text detection translates directly to improvements in applications requiring text recognition in varied environments, such as autonomous driving and augmented reality. Theoretically, this work suggests new directions for handling object detection problems involving irregularly shaped or densely packed instances.
Future research could explore end-to-end learning integration for the scale expansion process, potentially leading to performance enhancements and reduced computational costs. Additionally, adapting the progressive scale expansion methodology to other instance segmentation tasks may offer solutions to similar challenges in crowded object scenes.
In summary, PSENet represents a significant stride in the field of text detection, providing a robust framework for challenging scenarios involving complex text shapes and close proximities. The approach balances innovation with practicality, setting a precedent for future developments in both text and broad instance-level segmentation tasks.