Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion (2202.10304v1)

Published 21 Feb 2022 in cs.CV

Abstract: Recently, segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field, because of their superiority in detecting the text instances of arbitrary shapes and extreme aspect ratios, profiting from the pixel-level descriptions. However, the vast majority of the existing segmentation-based approaches are limited to their complex post-processing algorithms and the scale robustness of their segmentation models, where the post-processing algorithms are not only isolated to the model optimization but also time-consuming and the scale robustness is usually strengthened by fusing multi-scale feature maps directly. In this paper, we propose a Differentiable Binarization (DB) module that integrates the binarization process, one of the most important steps in the post-processing procedure, into a segmentation network. Optimized along with the proposed DB module, the segmentation network can produce more accurate results, which enhances the accuracy of text detection with a simple pipeline. Furthermore, an efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively. By incorporating the proposed DB and ASF with the segmentation network, our proposed scene text detector consistently achieves state-of-the-art results, in terms of both detection accuracy and speed, on five standard benchmarks.

PDF Abstract

Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion

The paper "Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion" introduces a novel framework aimed at improving segmentation-based scene text detection methods. Significant advancements are achieved through the introduction of a Differentiable Binarization (DB) module and an Adaptive Scale Fusion (ASF) module, both of which are designed to address inherent limitations of existing approaches in terms of accuracy and efficiency.

The primary focus of the paper is the DB module, which enhances segmentation networks by integrating the binarization step directly into the training process. This integration allows for joint optimization and results in more accurate text detection. By employing an approximate binarization function, the DB module offers a differentiable alternative to standard binarization. The approach ensures robust separation of text regions, even in challenging scenarios involving arbitrary shapes and extreme aspect ratios. The innovation lies in the use of an adaptive threshold map, optimized during training, which provides tailored thresholds across different image regions, thereby improving detection precision without increasing inference time.

Additionally, the ASF module contributes to the improvement of scale robustness in segmentation networks. Traditional methods often utilize multi-scale feature fusion without considering the adaptive fusion of features across scales tailored to text instances of different sizes. The ASF module incorporates a spatial attention mechanism, learning attention weights at both stage-wise and spatial dimensions. This results in more effective fusion of features from multiple scales, substantially boosting performance on diverse text datasets.

Experimental results demonstrate that the proposed scene text detector achieves state-of-the-art accuracy and speed across five standard benchmarks, encompassing datasets like MSRA-TD500, CTW1500, and Total-Text. The detector, named DBNet++, consistently surpasses previous methods, proving advantageous in scenarios involving horizontal, multi-oriented, and curved text. Specifically, the detector excels in trade-offs between effectiveness and efficiency, demonstrating improvements in both precision and recall metrics compared to competing approaches.

These advancements carry implications for both practical applications and theoretical developments within the field. Practically, real-time scene text detection becomes feasible in applications requiring the handling of varying text shapes and sizes, such as automated reading systems, augmented reality, and navigation aids. Theoretically, the paper proposes a methodological approach to integrate key post-processing procedures into the training phase of neural networks, potentially influencing future research in marrying segmentation and post-processing techniques seamlessly.

Future developments in AI could build on this foundation to enhance object detection tasks in other domains, leveraging the principles of differentiable post-processing and adaptive feature fusion. The scalability of differentiable binarization highlights its potential use in broader contexts, extending beyond text detection to encompass general image processing tasks with irregular object boundaries. Additionally, continued refinement in feature fusion techniques may unlock efficiency gains and accuracy improvements, offering applications in real-time data-driven decision-making.

In summary, this paper contributes significantly to the scene text detection field, offering refined techniques that effectively balance detection accuracy and operational efficiency. The integration of differentiable binarization and adaptive scale fusion provides a robust framework for real-time applications, paving the way for future innovations in text detection and related image processing domains.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Minghui Liao (29 papers)
Zhisheng Zou (1 paper)
Zhaoyi Wan (9 papers)
Cong Yao (70 papers)
Xiang Bai (222 papers)

Citations (180)

View on Semantic Scholar

Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion (2202.10304v1)

Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion

Related Papers