Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Real-time Scene Text Detection with Differentiable Binarization (1911.08947v2)

Published 20 Nov 2019 in cs.CV

Abstract: Recently, segmentation-based methods are quite popular in scene text detection, as the segmentation results can more accurately describe scene text of various shapes such as curve text. However, the post-processing of binarization is essential for segmentation-based detection, which converts probability maps produced by a segmentation method into bounding boxes/regions of text. In this paper, we propose a module named Differentiable Binarization (DB), which can perform the binarization process in a segmentation network. Optimized along with a DB module, a segmentation network can adaptively set the thresholds for binarization, which not only simplifies the post-processing but also enhances the performance of text detection. Based on a simple segmentation network, we validate the performance improvements of DB on five benchmark datasets, which consistently achieves state-of-the-art results, in terms of both detection accuracy and speed. In particular, with a light-weight backbone, the performance improvements by DB are significant so that we can look for an ideal tradeoff between detection accuracy and efficiency. Specifically, with a backbone of ResNet-18, our detector achieves an F-measure of 82.8, running at 62 FPS, on the MSRA-TD500 dataset. Code is available at: https://github.com/MhLiao/DB

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Minghui Liao (29 papers)
  2. Zhaoyi Wan (9 papers)
  3. Cong Yao (70 papers)
  4. Kai Chen (512 papers)
  5. Xiang Bai (222 papers)
Citations (618)

Summary

The paper "Real-time Scene Text Detection with Differentiable Binarization" (Liao et al., 2019 ) addresses the challenge of efficiently and accurately detecting scene text, particularly arbitrary shapes like curved text. While segmentation-based methods excel at capturing complex shapes, they traditionally suffer from complex and slow post-processing steps involving binarization and grouping. The core contribution of this paper is the introduction of Differentiable Binarization (DB), a technique that makes the binarization process end-to-end trainable within a deep learning framework, significantly simplifying post-processing and improving performance and speed.

The standard binarization process converts a probability map PP (output of a segmentation network) into a binary map BB using a fixed threshold tt:

Bi,j={1if Pi,jt 0otherwiseB_{i, j}= \begin{cases} 1 & \text{if } P_{i, j} \ge t \ 0 & \text{otherwise} \end{cases}

This step function is non-differentiable, preventing it from being directly optimized during training with the segmentation network. DB replaces this with an approximate, differentiable function based on a sigmoid-like shape:

B^i,j=11+ek(Pi,jTi,j)\hat{B}_{i, j}=\frac{1}{1+e^{-k (P_{i, j} - T_{i, j})}}

Here, B^\hat{B} is the approximate binary map, PP is the probability map, TT is a learned adaptive threshold map, and kk is an amplifying factor (empirically set to 50) that steepens the curve, making the approximation closer to the step function. This function is differentiable, allowing gradients to flow back through the binarization process to update the network weights, including those predicting the adaptive threshold map TT. The authors show that this differentiability and gradient amplification around the threshold facilitate training and lead to more distinctive predictions.

The proposed architecture consists of a feature pyramid backbone (like ResNet-18 or ResNet-50), followed by a neck that fuses features from different scales. This fused feature map FF is then fed into two prediction heads: one predicting the probability map PP and another predicting the threshold map TT. The differentiable binarization module then takes PP and TT to produce the approximate binary map B^\hat{B}.

For training, supervision is applied to the probability map PP, the approximate binary map B^\hat{B}, and the threshold map TT. PP and B^\hat{B} share the same ground truth labels. The loss function is a weighted sum: L=Ls+α×Lb+β×LtL = L_{s} + \alpha \times L_{b} + \beta \times L_{t}. LsL_s and LbL_b are binary cross-entropy losses applied to PP and B^\hat{B} respectively. Hard negative mining is used in these losses to handle class imbalance. LtL_t is an L1 loss applied to the predicted threshold map TT within a specific border region around the text instances. The weights are α=1.0\alpha=1.0 and β=10\beta=10.

Label generation is crucial. For a text polygon GG, the ground truth for the probability map is a shrunk polygon GsG_s. The shrinking offset DD is calculated based on the polygon's perimeter LL and area AA: D=A(1r2)LD = \frac{A(1-r^{2})}{L}, where rr is a shrink ratio (0.4). The ground truth for the threshold map TT is generated in the border region between GsG_s and a dilated polygon GdG_d. GdG_d is obtained by dilating GG with the same offset DD. The threshold label within this border region is defined by the distance to the closest segment of the original polygon GG.

Deformable convolutions are incorporated into the backbone to handle the varying aspect ratios and shapes of text instances by providing a more flexible receptive field.

During inference, the network predicts the probability map PP. The approximate binary map B^\hat{B} can also be used, but PP is typically used for better efficiency as the threshold prediction branch can be discarded. The probability map PP is binarized using a constant threshold (e.g., 0.2). Connected components are then found in this binary map. These components represent the shrunk text regions. Finally, these shrunk regions are dilated back to approximate the original text instance shape using the Vatti clipping algorithm. The dilation offset DD' is calculated based on the area AA' and perimeter LL' of the shrunk region: D=A×rLD' = \frac{A' \times r'}{L'}, where rr' is an empirical ratio (1.5). This simple connected components and dilation post-processing is significantly faster than complex clustering methods used in other segmentation-based approaches.

Practical Implementation and Performance:

The paper demonstrates the effectiveness and efficiency of DB across various benchmark datasets (Total-Text, CTW1500, ICDAR 2015, MSRA-TD500, MLT-2017), covering curved, multi-oriented, and multi-language text.

  • Performance Gains: Ablation studies show that DB significantly improves F-measure on MSRA-TD500 and CTW1500 for both ResNet-18 and ResNet-50 backbones. For example, on MSRA-TD500 with ResNet-18, DB boosts F-measure from 77.4% to 81.1% (without deformable conv) or 78.9% to 82.8% (with deformable conv). Deformable convolution also contributes noticeable performance gains. Supervising the threshold map provides a further boost in accuracy.
  • Speed: A major practical advantage is the high inference speed achieved due to the simplified post-processing. With a ResNet-18 backbone, the method achieves real-time speeds, e.g., 62 FPS on MSRA-TD500 (736 input height) and 55 FPS on CTW1500 (1024 input height), while maintaining competitive accuracy. With ResNet-50, it achieves higher accuracy (e.g., 84.9% F-measure on MSRA-TD500) at still impressive speeds (32 FPS). The paper highlights that the speed is significantly faster than many prior state-of-the-art segmentation-based methods like PSENet.
  • Backbone Choice: The paper provides a clear trade-off between accuracy and speed by offering configurations with ResNet-18 (faster, slightly lower accuracy) and ResNet-50 (slower, higher accuracy). This allows practitioners to choose based on their specific application requirements.
  • Real-World Applications: The robustness across various text shapes and languages, combined with real-time speed, makes DB suitable for applications requiring on-device text detection or processing video streams, such as mobile OCR, autonomous driving, and real-time video analysis.

Limitations:

The paper notes one limitation: the method may struggle with "text inside text" cases where one text instance is entirely contained within another. This is a common challenge for segmentation-based methods relying on connected components of shrunk regions.

In summary, the DB paper presents a practical and effective approach to scene text detection by integrating binarization into the training process, enabling end-to-end optimization. This leads to a detector that is accurate for arbitrary shapes and exceptionally fast, making it highly applicable for real-time systems.