Turning a CLIP Model into a Scene Text Detector (2302.14338v3)

Published 28 Feb 2023 in cs.CV

Abstract: The recent large-scale Contrastive Language-Image Pretraining (CLIP) model has shown great potential in various downstream tasks via leveraging the pretrained vision and language knowledge. Scene text, which contains rich textual and visual information, has an inherent connection with a model like CLIP. Recently, pretraining approaches based on vision LLMs have made effective progresses in the field of text detection. In contrast to these works, this paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process. We demonstrate the advantages of the proposed TCM as follows: (1) The underlying principle of our framework can be applied to improve existing scene text detector. (2) It facilitates the few-shot training capability of existing methods, e.g., by using 10% of labeled data, we significantly improve the performance of the baseline method with an average of 22% in terms of the F-measure on 4 benchmarks. (3) By turning the CLIP model into existing scene text detection methods, we further achieve promising domain adaptation ability. The code will be publicly released at https://github.com/wenwenyu/TCM.

Citations (42)

View on Semantic Scholar

Summary

The paper demonstrates that TCM transforms a pretrained CLIP model into an effective scene text detector without additional pretraining, achieving a 22% boost in F-measure with only 10% labeled data.
The methodology leverages cross-modal interaction through visual prompt learning and instance-language matching to recover fine-grained textual details.
The approach exhibits robust domain adaptation and improved performance across benchmarks, ensuring efficient text detection in low-data and out-of-distribution scenarios.

Turning a CLIP Model into a Scene Text Detector

The paper by Wenwen Yu et al. introduces an innovative approach titled "Turning a CLIP Model into a Scene Text Detector," which proposes a novel method termed TCM to transform a Contrastive Language-Image Pretraining (CLIP) model into a scene text detector without the need for additional pretraining. Given the substantial potential of CLIP models in leveraging vision and language knowledge, this paper aims to harness these capabilities for the purpose of detecting text in natural scenes effectively.

In traditional scene text detection, extensive annotated data is typically required for successful model training. The TCM framework addresses this limitation by enabling significant improvements in few-shot learning—where using only 10% of labeled data can enhance performance by an average of 22% based on F-measure benchmarks. This represents a significant departure from existing methodologies, which often rely on large-scale fully annotated datasets.

The methodological core of TCM involves a cross-modal interaction mechanism facilitated through visual prompt learning and an instance-language matching strategy. By implementing cross-attention operations, TCM effectively recovers fine-grained textual information from the CLIP image encoder outputs. Additionally, the framework features a language prompt generator that optimizes the text embeddings contextually for each specific image, which significantly aids the steering of pretrained knowledge from CLIP’s text encoder.

An important claim made by the authors is TCM's efficiency in domain adaptation, demonstrated by robust performance even when training data is out-of-distribution relative to testing data. The integration of a predefined language prompt alongside learnable prompts further enhances this adaptability. Encouragingly, empirical results exhibit a marked increase in performance across multiple benchmarks, such as ICDAR2015, MSRA-TD500, and CTW1500, when integrating TCM with existing text detection methods like DBNet, PAN, and FCENet.

Moreover, this approach mitigates the need for exhaustive pretext tasks that characterize prior pretraining techniques, yet still capitalizes on the semantic richness inherent in the pretrained CLIP model. The evaluation spans several experimental settings, including few-shot scenarios and domain adaptation, consistently highlighting the model's enhanced capability in low-data regimes and out-of-distribution datasets.

In conclusion, the paper presents a compelling case for the utility of TCM, offering significant improvements in the field of scene text detection by leveraging the latent capabilities of the CLIP model. This advance presages potential extensions into other related fields, such as scene text spotting or multimodal analytics, underscoring the versatility and practicality of integrating vision and language representations. Future research may explore further optimizations of interaction mechanisms or broader applications within AI-driven visual understanding tasks.

PDF Markdown

Related Papers

GitHub

GitHub - wenwenyu/TCM: Turning a CLIP Model into a Scene Text Detector (CVPR2023) (194 stars)