- The paper introduces Differentiable Binarization (DB), a method that replaces non-differentiable thresholding with a learnable, sigmoid-like approximation to streamline training.
- It demonstrates notable performance improvements and real-time speeds, achieving up to 62 FPS and higher F-measures on diverse text datasets.
- The approach employs a dual-head network with adaptive threshold maps and deformable convolutions to robustly detect arbitrarily shaped scene text.
The paper "Real-time Scene Text Detection with Differentiable Binarization" (1911.08947) addresses the challenge of efficiently and accurately detecting scene text, particularly arbitrary shapes like curved text. While segmentation-based methods excel at capturing complex shapes, they traditionally suffer from complex and slow post-processing steps involving binarization and grouping. The core contribution of this paper is the introduction of Differentiable Binarization (DB), a technique that makes the binarization process end-to-end trainable within a deep learning framework, significantly simplifying post-processing and improving performance and speed.
The standard binarization process converts a probability map P (output of a segmentation network) into a binary map B using a fixed threshold t:
Bi,j={1if Pi,j≥t 0otherwise
This step function is non-differentiable, preventing it from being directly optimized during training with the segmentation network. DB replaces this with an approximate, differentiable function based on a sigmoid-like shape:
B^i,j=1+e−k(Pi,j−Ti,j)1
Here, B^ is the approximate binary map, P is the probability map, T is a learned adaptive threshold map, and k is an amplifying factor (empirically set to 50) that steepens the curve, making the approximation closer to the step function. This function is differentiable, allowing gradients to flow back through the binarization process to update the network weights, including those predicting the adaptive threshold map T. The authors show that this differentiability and gradient amplification around the threshold facilitate training and lead to more distinctive predictions.
The proposed architecture consists of a feature pyramid backbone (like ResNet-18 or ResNet-50), followed by a neck that fuses features from different scales. This fused feature map F is then fed into two prediction heads: one predicting the probability map P and another predicting the threshold map T. The differentiable binarization module then takes P and T to produce the approximate binary map B^.
For training, supervision is applied to the probability map P, the approximate binary map B^, and the threshold map T. P and B^ share the same ground truth labels. The loss function is a weighted sum: L=Ls+α×Lb+β×Lt. Ls and Lb are binary cross-entropy losses applied to P and B^ respectively. Hard negative mining is used in these losses to handle class imbalance. Lt is an L1 loss applied to the predicted threshold map T within a specific border region around the text instances. The weights are α=1.0 and β=10.
Label generation is crucial. For a text polygon G, the ground truth for the probability map is a shrunk polygon Gs. The shrinking offset D is calculated based on the polygon's perimeter L and area A: D=LA(1−r2), where r is a shrink ratio (0.4). The ground truth for the threshold map T is generated in the border region between Gs and a dilated polygon Gd. Gd is obtained by dilating G with the same offset D. The threshold label within this border region is defined by the distance to the closest segment of the original polygon G.
Deformable convolutions are incorporated into the backbone to handle the varying aspect ratios and shapes of text instances by providing a more flexible receptive field.
During inference, the network predicts the probability map P. The approximate binary map B^ can also be used, but P is typically used for better efficiency as the threshold prediction branch can be discarded. The probability map P is binarized using a constant threshold (e.g., 0.2). Connected components are then found in this binary map. These components represent the shrunk text regions. Finally, these shrunk regions are dilated back to approximate the original text instance shape using the Vatti clipping algorithm. The dilation offset D′ is calculated based on the area A′ and perimeter L′ of the shrunk region: D′=L′A′×r′, where r′ is an empirical ratio (1.5). This simple connected components and dilation post-processing is significantly faster than complex clustering methods used in other segmentation-based approaches.
Practical Implementation and Performance:
The paper demonstrates the effectiveness and efficiency of DB across various benchmark datasets (Total-Text, CTW1500, ICDAR 2015, MSRA-TD500, MLT-2017), covering curved, multi-oriented, and multi-language text.
- Performance Gains: Ablation studies show that DB significantly improves F-measure on MSRA-TD500 and CTW1500 for both ResNet-18 and ResNet-50 backbones. For example, on MSRA-TD500 with ResNet-18, DB boosts F-measure from 77.4% to 81.1% (without deformable conv) or 78.9% to 82.8% (with deformable conv). Deformable convolution also contributes noticeable performance gains. Supervising the threshold map provides a further boost in accuracy.
- Speed: A major practical advantage is the high inference speed achieved due to the simplified post-processing. With a ResNet-18 backbone, the method achieves real-time speeds, e.g., 62 FPS on MSRA-TD500 (736 input height) and 55 FPS on CTW1500 (1024 input height), while maintaining competitive accuracy. With ResNet-50, it achieves higher accuracy (e.g., 84.9% F-measure on MSRA-TD500) at still impressive speeds (32 FPS). The paper highlights that the speed is significantly faster than many prior state-of-the-art segmentation-based methods like PSENet.
- Backbone Choice: The paper provides a clear trade-off between accuracy and speed by offering configurations with ResNet-18 (faster, slightly lower accuracy) and ResNet-50 (slower, higher accuracy). This allows practitioners to choose based on their specific application requirements.
- Real-World Applications: The robustness across various text shapes and languages, combined with real-time speed, makes DB suitable for applications requiring on-device text detection or processing video streams, such as mobile OCR, autonomous driving, and real-time video analysis.
Limitations:
The paper notes one limitation: the method may struggle with "text inside text" cases where one text instance is entirely contained within another. This is a common challenge for segmentation-based methods relying on connected components of shrunk regions.
In summary, the DB paper presents a practical and effective approach to scene text detection by integrating binarization into the training process, enabling end-to-end optimization. This leads to a detector that is accurate for arbitrary shapes and exceptionally fast, making it highly applicable for real-time systems.