DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer (2207.04491v2)

Published 10 Jul 2022 in cs.CV

Abstract: Recently, Transformer-based methods, which predict polygon points or Bezier curve control points for localizing texts, are popular in scene text detection. However, these methods built upon detection transformer framework might achieve sub-optimal training efficiency and performance due to coarse positional query modeling.In addition, the point label form exploited in previous works implies the reading order of humans, which impedes the detection robustness from our observation. To address these challenges, this paper proposes a concise Dynamic Point Text DEtection TRansformer network, termed DPText-DETR. In detail, DPText-DETR directly leverages explicit point coordinates to generate position queries and dynamically updates them in a progressive way. Moreover, to improve the spatial inductive bias of non-local self-attention in Transformer, we present an Enhanced Factorized Self-Attention module which provides point queries within each instance with circular shape guidance. Furthermore, we design a simple yet effective positional label form to tackle the side effect of the previous form. To further evaluate the impact of different label forms on the detection robustness in real-world scenario, we establish an Inverse-Text test set containing 500 manually labeled images. Extensive experiments prove the high training efficiency, robustness, and state-of-the-art performance of our method on popular benchmarks. The code and the Inverse-Text test set are available at https://github.com/ymy-k/DPText-DETR.

PDF Abstract

Analysis of "DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer"

This paper introduces DPText-DETR, a novel model designed to enhance scene text detection by utilizing dynamic points within a Transformer framework. The paper specifically addresses two prevalent issues in contemporary Transformer-based text detection methods: the inefficiency in training due to coarse positional query modeling, and the robustness problems caused by point label forms that imply human reading order.

Key Contributions

The primary contribution of this work lies in the explicit modeling of point queries through a newly proposed network, DPText-DETR. This model aims to improve positional query generation by employing explicit point coordinates and updating them progressively throughout the network's layers. Additionally, an Enhanced Factorized Self-Attention (EFSA) module is integrated to enhance the spatial inductive bias of the Transformer, providing point queries with a circular shape guidance. This approach seeks to leverage the advantages of Transformer architectures while overcoming their common limitations.

Methodology and Implementation

DPText-DETR significantly deviates from previous methods that largely depend on polygon points or Bezier curve control points. Instead, it introduces an explicit point query modeling technique that directly utilizes point coordinates to create position queries. These points are updated dynamically, which contrasts with the static approaches in conventional frameworks. The EFSA module further refines spatial relationships using circular convolution, replacing traditional self-attention mechanisms that may lag in spatial recognition ability.

A novel positional label form is also proposed, removing the dependency on the reading order by focusing on a clockwise order of control points, leading to improved detection robustness. The introduction of the Inverse-Text test set, which contains 500 manually labeled images featuring inverse-like text instances, supports these claims by providing a new benchmark for evaluation.

Experimental Results

The experiments demonstrate marked improvements in training efficiency, robustness, and overall performance on several popular benchmarks, including Total-Text, CTW1500, and ICDAR19 ArT. DPText-DETR's results establish a new state-of-the-art in these datasets. The model's ability to converge faster during training when compared to similar architectures highlights its enhanced efficiency and reduced computational demand.

Implications and Future Directions

The implications of DPText-DETR on the scene text detection field are significant due to its novel approach to query modeling and attention mechanisms. By transitioning from bounding box-based query generation to a point-based system, this paper opens pathways for future research to explore even more granular text representation models within Transformers. The introduction of EFSA suggests potential adaptations in other vision-based tasks where spatial attention is crucial.

There is potential for future developments in integrating this model with complete text spotting systems to create a holistic end-to-end text detection and recognition pipeline. Another promising direction is the application of this model to non-native text datasets to explore its robustness in different linguistic contexts.

Conclusion

The paper offers substantial contributions to the field of scene text detection by enhancing robustness and efficiency through innovative query modeling and attention mechanisms within Transformers. DPText-DETR stands as a methodological advancement with implications for broadening applications of Transformers in various computer vision tasks. The encouraging results make it a notable addition to the existing landscape of text detection technologies, fostering further exploration and refinement in this area of research.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Maoyuan Ye (9 papers)
Jing Zhang (730 papers)
Shanshan Zhao (39 papers)
Juhua Liu (37 papers)
Bo Du (263 papers)
Dacheng Tao (826 papers)

Citations (62)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ymy-k/DPText-DETR: [AAAI'23 Oral] DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer (170 stars)