Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vision-Language Pre-Training for Boosting Scene Text Detectors (2204.13867v1)

Published 29 Apr 2022 in cs.CV, cs.AI, and cs.CL

Abstract: Recently, vision-language joint representation learning has proven to be highly effective in various scenarios. In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language. Concretely, we propose to learn contextualized, joint representations through vision-language pre-training, for the sake of enhancing the performance of scene text detectors. Towards this end, we devise a pre-training architecture with an image encoder, a text encoder and a cross-modal encoder, as well as three pretext tasks: image-text contrastive learning (ITC), masked LLMing (MLM) and word-in-image prediction (WIP). The pre-trained model is able to produce more informative representations with richer semantics, which could readily benefit existing scene text detectors (such as EAST and PSENet) in the down-stream text detection task. Extensive experiments on standard benchmarks demonstrate that the proposed paradigm can significantly improve the performance of various representative text detectors, outperforming previous pre-training approaches. The code and pre-trained models will be publicly released.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Sibo Song (13 papers)
  2. Jianqiang Wan (6 papers)
  3. Zhibo Yang (43 papers)
  4. Jun Tang (33 papers)
  5. Wenqing Cheng (12 papers)
  6. Xiang Bai (221 papers)
  7. Cong Yao (70 papers)
Citations (20)