Real-time Scene Text Detection with Differentiable Binarization
(1911.08947v2)
Published 20 Nov 2019 in cs.CV
Abstract: Recently, segmentation-based methods are quite popular in scene text detection, as the segmentation results can more accurately describe scene text of various shapes such as curve text. However, the post-processing of binarization is essential for segmentation-based detection, which converts probability maps produced by a segmentation method into bounding boxes/regions of text. In this paper, we propose a module named Differentiable Binarization (DB), which can perform the binarization process in a segmentation network. Optimized along with a DB module, a segmentation network can adaptively set the thresholds for binarization, which not only simplifies the post-processing but also enhances the performance of text detection. Based on a simple segmentation network, we validate the performance improvements of DB on five benchmark datasets, which consistently achieves state-of-the-art results, in terms of both detection accuracy and speed. In particular, with a light-weight backbone, the performance improvements by DB are significant so that we can look for an ideal tradeoff between detection accuracy and efficiency. Specifically, with a backbone of ResNet-18, our detector achieves an F-measure of 82.8, running at 62 FPS, on the MSRA-TD500 dataset. Code is available at: https://github.com/MhLiao/DB
The paper "Real-time Scene Text Detection with Differentiable Binarization" (Liao et al., 2019) addresses the challenge of efficiently and accurately detecting scene text, particularly arbitrary shapes like curved text. While segmentation-based methods excel at capturing complex shapes, they traditionally suffer from complex and slow post-processing steps involving binarization and grouping. The core contribution of this paper is the introduction of Differentiable Binarization (DB), a technique that makes the binarization process end-to-end trainable within a deep learning framework, significantly simplifying post-processing and improving performance and speed.
The standard binarization process converts a probability map P (output of a segmentation network) into a binary map B using a fixed threshold t:
Bi,j={1if Pi,j≥t0otherwise
This step function is non-differentiable, preventing it from being directly optimized during training with the segmentation network. DB replaces this with an approximate, differentiable function based on a sigmoid-like shape:
B^i,j=1+e−k(Pi,j−Ti,j)1
Here, B^ is the approximate binary map, P is the probability map, T is a learned adaptive threshold map, and k is an amplifying factor (empirically set to 50) that steepens the curve, making the approximation closer to the step function. This function is differentiable, allowing gradients to flow back through the binarization process to update the network weights, including those predicting the adaptive threshold map T. The authors show that this differentiability and gradient amplification around the threshold facilitate training and lead to more distinctive predictions.
The proposed architecture consists of a feature pyramid backbone (like ResNet-18 or ResNet-50), followed by a neck that fuses features from different scales. This fused feature map F is then fed into two prediction heads: one predicting the probability map P and another predicting the threshold map T. The differentiable binarization module then takes P and T to produce the approximate binary map B^.
For training, supervision is applied to the probability map P, the approximate binary map B^, and the threshold map T. P and B^ share the same ground truth labels. The loss function is a weighted sum: L=Ls+α×Lb+β×Lt. Ls and Lb are binary cross-entropy losses applied to P and B^ respectively. Hard negative mining is used in these losses to handle class imbalance. Lt is an L1 loss applied to the predicted threshold map T within a specific border region around the text instances. The weights are α=1.0 and β=10.
Label generation is crucial. For a text polygon G, the ground truth for the probability map is a shrunk polygon Gs. The shrinking offset D is calculated based on the polygon's perimeter L and area A: D=LA(1−r2), where r is a shrink ratio (0.4). The ground truth for the threshold map T is generated in the border region between Gs and a dilated polygon Gd. Gd is obtained by dilating G with the same offset D. The threshold label within this border region is defined by the distance to the closest segment of the original polygon G.
Deformable convolutions are incorporated into the backbone to handle the varying aspect ratios and shapes of text instances by providing a more flexible receptive field.
During inference, the network predicts the probability map P. The approximate binary map B^ can also be used, but P is typically used for better efficiency as the threshold prediction branch can be discarded. The probability map P is binarized using a constant threshold (e.g., 0.2). Connected components are then found in this binary map. These components represent the shrunk text regions. Finally, these shrunk regions are dilated back to approximate the original text instance shape using the Vatti clipping algorithm. The dilation offset D′ is calculated based on the area A′ and perimeter L′ of the shrunk region: D′=L′A′×r′, where r′ is an empirical ratio (1.5). This simple connected components and dilation post-processing is significantly faster than complex clustering methods used in other segmentation-based approaches.
Practical Implementation and Performance:
The paper demonstrates the effectiveness and efficiency of DB across various benchmark datasets (Total-Text, CTW1500, ICDAR 2015, MSRA-TD500, MLT-2017), covering curved, multi-oriented, and multi-language text.
Performance Gains: Ablation studies show that DB significantly improves F-measure on MSRA-TD500 and CTW1500 for both ResNet-18 and ResNet-50 backbones. For example, on MSRA-TD500 with ResNet-18, DB boosts F-measure from 77.4% to 81.1% (without deformable conv) or 78.9% to 82.8% (with deformable conv). Deformable convolution also contributes noticeable performance gains. Supervising the threshold map provides a further boost in accuracy.
Speed: A major practical advantage is the high inference speed achieved due to the simplified post-processing. With a ResNet-18 backbone, the method achieves real-time speeds, e.g., 62 FPS on MSRA-TD500 (736 input height) and 55 FPS on CTW1500 (1024 input height), while maintaining competitive accuracy. With ResNet-50, it achieves higher accuracy (e.g., 84.9% F-measure on MSRA-TD500) at still impressive speeds (32 FPS). The paper highlights that the speed is significantly faster than many prior state-of-the-art segmentation-based methods like PSENet.
Backbone Choice: The paper provides a clear trade-off between accuracy and speed by offering configurations with ResNet-18 (faster, slightly lower accuracy) and ResNet-50 (slower, higher accuracy). This allows practitioners to choose based on their specific application requirements.
Real-World Applications: The robustness across various text shapes and languages, combined with real-time speed, makes DB suitable for applications requiring on-device text detection or processing video streams, such as mobile OCR, autonomous driving, and real-time video analysis.
Limitations:
The paper notes one limitation: the method may struggle with "text inside text" cases where one text instance is entirely contained within another. This is a common challenge for segmentation-based methods relying on connected components of shrunk regions.
In summary, the DB paper presents a practical and effective approach to scene text detection by integrating binarization into the training process, enabling end-to-end optimization. This leads to a detector that is accurate for arbitrary shapes and exceptionally fast, making it highly applicable for real-time systems.