Focus, Distinguish, and Prompt: Unleashing CLIP for Efficient and Flexible Scene Text Retrieval (2408.00441v1)
Abstract: Scene text retrieval aims to find all images containing the query text from an image gallery. Current efforts tend to adopt an Optical Character Recognition (OCR) pipeline, which requires complicated text detection and/or recognition processes, resulting in inefficient and inflexible retrieval. Different from them, in this work we propose to explore the intrinsic potential of Contrastive Language-Image Pre-training (CLIP) for OCR-free scene text retrieval. Through empirical analysis, we observe that the main challenges of CLIP as a text retriever are: 1) limited text perceptual scale, and 2) entangled visual-semantic concepts. To this end, a novel model termed FDP (Focus, Distinguish, and Prompt) is developed. FDP first focuses on scene text via shifting the attention to the text area and probing the hidden text knowledge, and then divides the query text into content word and function word for processing, in which a semantic-aware prompting scheme and a distracted queries assistance module are utilized. Extensive experiments show that FDP significantly enhances the inference speed while achieving better or competitive retrieval accuracy compared to existing methods. Notably, on the IIIT-STR benchmark, FDP surpasses the state-of-the-art model by 4.37% with a 4 times faster speed. Furthermore, additional experiments under phrase-level and attribute-aware scene text retrieval settings validate FDP's particular advantages in handling diverse forms of query text. The source code will be publicly available at https://github.com/Gyann-z/FDP.
- Integrating visual and textual cues for query-by-string word spotting. In ICDAR. 511–515.
- Word spotting and recognition with embedded attributes. TPAMI 36, 12 (2014), 2552–2566.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023).
- Less is more: Removing text-regions improves clip training efficiency and robustness. arXiv preprint arXiv:2305.05095 (2023).
- Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-text: A comprehensive dataset for scene text detection and recognition. In ICDAR, Vol. 1. 935–942.
- SVTR: Scene Text Recognition with a Single Visual Model. In IJCAI. 884–890.
- UATVR: Uncertainty-adaptive text-video retrieval. In ICCV. 13723–13733.
- DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding. arXiv preprint arXiv:2311.11810 (2023).
- Efficient indexing for query by string text retrieval. In ICDAR. 1236–1240.
- Suman K Ghosh and Ernest Valveny. 2015. Query by string word spotting based on character bi-gram indexing. In ICDAR. 881–885.
- Single shot scene text retrieval. In ECCV. 700–715.
- An end-to-end textspotter with explicit alignment and attention. In CVPR. 5020–5029.
- mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding. arXiv preprint arXiv:2403.12895 (2024).
- Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In CVPR. 4593–4603.
- Reading text in the wild with convolutional neural networks. IJCV 116, 1 (2016), 1–20.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. In ICLR. 4190–4198.
- Monkey: Image resolution and text label are important things for large multi-modal models. In CVPR. 26763–26773.
- Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR. 7061–7070.
- Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In ECCV. 706–722.
- Parrot Captions Teach CLIP to Spot Text. arXiv preprint arXiv:2312.14232 (2023).
- Frozen clip models are efficient video learners. In ECCV. 388–404.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586 (2021).
- ABCNet: Real-time scene text spotting with adaptive bezier-curve network. In CVPR. 9809–9818.
- On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895 (2023).
- Kosmos-2.5: A multimodal literate model. arXiv preprint arXiv:2309.11419 (2023).
- Real-time lexicon-free scene text retrieval. PR 110 (2021), 107656.
- Disentangling visual and written concepts in clip. In CVPR. 16410–16419.
- Image retrieval using textual cues. In ICCV. 3040–3047.
- ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition—RRC-MLT-2019. In ICDAR. 1582–1587.
- PIMNet: a parallel, iterative and mimicking network for scene text recognition. In ACM MM. 2046–2055.
- SEED: Semantics enhanced encoder-decoder framework for scene text recognition. In CVPR. 13528–13537.
- Towards robust real-time scene text detection: From semantic to instance representation learning. In ACM MM. 2025–2034.
- Mask is all you need: Rethinking mask r-cnn for dense and arbitrary-shaped scene text detection. In ACM MM. 414–423.
- Learning transferable visual models from natural language supervision. In ICML. 8748–8763.
- Denseclip: Language-guided dense prediction with context-aware prompting. In CVPR. 18082–18091.
- Unambiguous Text Localization, Retrieval, and Recognition for Cluttered Scenes. TPAMI 44, 3 (2022), 1638–1652.
- Cheng Shi and Sibei Yang. 2023. Logoprompt: Synthetic text images can be good visual prompts for vision-language models. In ICCV. 2932–2941.
- Perceiving ambiguity and semantics without recognition: an efficient and effective ambiguous scene text detector. In ACM MM. 1851–1862.
- TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text. In CVPR. 8802–8812.
- Sebastian Sudholt and Gernot A Fink. 2016. Phocnet: A deep convolutional neural network for word spotting in handwritten documents. In ICFHR. 277–282.
- Scene text retrieval via joint text detection and similarity learning. In CVPR. 4558–4567.
- End-to-end scene text recognition. In ICCV. 1457–1464.
- Pgnet: Real-time arbitrarily-shaped text spotting with point gathering network. In AAAI, Vol. 35. 2782–2790.
- Tpsnet: Reverse thinking of thin plate splines for arbitrary shape scene text representation. In ACM MM. 5014–5025.
- Symmetrical Linguistic Feature Distillation with CLIP for Scene Text Recognition. In ACM MM. 509–518.
- Textblock: Towards scene text spotting without fine-grained detection. In ACM MM. 5892–5902.
- Visual Matching is Enough for Scene Text Retrieval. In WSDM. 447–455.
- Smart library: Identifying books on library shelves using supervised deep learning for scene text reading. In JCDL. 1–4.
- Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126 (2023).
- Accurate and robust text detection: A step-in for text retrieval in natural scene images. In SIGIR. 1091–1092.
- Turning a CLIP Model into a Scene Text Detector. In CVPR. 6978–6988.
- Filling in the blank: Rationale-augmented prompt tuning for TextVQA. In ACM MM. 1261–1272.
- Beyond OCR+ VQA: Towards end-to-end reading and reasoning for robust and accurate textvqa. PR 138 (2023), 109337.
- Exploring Perceptual Limitation of Multimodal Large Language Models. arXiv preprint arXiv:2402.07384 (2024).
- CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model. arXiv preprint arXiv:2305.14014 (2023).
- Cdistnet: Perceiving multi-domain character distance for robust text recognition. IJCV 132, 2 (2024), 300–318.
- Extract free dense labels from clip. In ECCV. 696–712.
- Learning to prompt for vision-language models. IJCV 130, 9 (2022), 2337–2348.
- Gangyan Zeng (6 papers)
- Yuan Zhang (331 papers)
- Jin Wei (16 papers)
- Dongbao Yang (16 papers)
- Peng Zhang (642 papers)
- Yiwen Gao (5 papers)
- Xugong Qin (8 papers)
- Yu Zhou (335 papers)