TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model (2403.10047v1)
Abstract: Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained LLMs (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precise detection just like human beings?", and if yes, 2) "Is text block another alternative for scene text spotting other than word or character?" To this end, our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection. Specifically, we first use a simple detector for block-level text detection to obtain rough positional information. Then, we finetune a PLM using a large-scale OCR dataset to achieve accurate recognition. Benefiting from the comprehensive language knowledge gained during the pre-training phase, the PLM-based recognition module effectively handles complex scenarios, including multi-line, reversed, occluded, and incomplete-detection texts. Taking advantage of the fine-tuned LLM on scene recognition benchmarks and the paradigm of text block detection, extensive experiments demonstrate the superior performance of our scene text spotter across multiple public benchmarks. Additionally, we attempt to spot texts directly from an entire scene image to demonstrate the potential of PLMs, even LLMs.
- Y. Liu, H. Chen, C. Shen, T. He, L. Jin, and L. Wang, “ABCNet: Real-time scene text spotting with adaptive bezier-curve network,” in CVPR, 2020, pp. 9809–9818.
- W. Wang, X. Liu, X. Ji, E. Xie, D. Liang, Z. Yang, T. Lu, C. Shen, and P. Luo, “AE Textspotter: Learning visual and linguistic representation for ambiguous text spotting,” in ECCV. Springer, 2020, pp. 457–473.
- X. Zhang, Y. Su, S. Tripathi, and Z. Tu, “Text Spotting Transformers,” in CVPR, 2022, pp. 9519–9528.
- W. Wang, Y. Zhou, J. Lv, D. Wu, G. Zhao, N. Jiang, and W. Wang, “TPSNet: Reverse thinking of thin plate splines for arbitrary shape scene text representation,” in ACM MM, 2022, pp. 5014–5025.
- Y. Shu, W. Wang, Y. Zhou, S. Liu, A. Zhang, D. Yang, and W. Wang, “Perceiving ambiguity and semantics without recognition: an efficient and effective ambiguous scene text detector,” in ACM MM, 2023, pp. 1851–1862.
- P. Dai, Y. Li, H. Zhang, J. Li, and X. Cao, “Accurate scene text detection via scale-aware data augmentation and shape similarity constraint,” IEEE TMM, vol. 24, pp. 1883–1895, 2021.
- X. Qin, Y. Zhou, Y. Guo, D. Wu, Z. Tian, N. Jiang, H. Wang, and W. Wang, “Mask is all you need: Rethinking mask r-cnn for dense and arbitrary-shaped scene text detection,” in ACM MM, 2021, pp. 414–423.
- S.-X. Zhang, X. Zhu, J.-B. Hou, C. Yang, and X.-C. Yin, “Kernel proposal network for arbitrary shape text detection,” IEEE TNNLS, 2022.
- Y. Shu, S. Liu, Y. Zhou, H. Xu, and F. Jiang, “EI22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTSR: Learning an enhanced intra-instance semantic relationship for arbitrary-shaped scene text detection,” in ICASSP. IEEE, 2023, pp. 1–5.
- X. Qin, P. Lyu, C. Zhang, Y. Zhou, K. Yao, P. Zhang, H. Lin, and W. Wang, “Towards robust real-time scene text detection: From semantic to instance representation learning,” in ACM MM, 2023, pp. 2025–2034.
- C. Yang, M. Chen, Y. Yuan, and Q. Wang, “Zoom text detector,” IEEE TNNLS, 2023.
- Z. Qiao, Y. Zhou, D. Yang, Y. Zhou, and W. Wang, “SEED: Semantics enhanced encoder-decoder framework for scene text recognition,” in CVPR, 2020, pp. 13 528–13 537.
- Z. Qiao, Y. Zhou, J. Wei, W. Wang, Y. Zhang, N. Jiang, H. Wang, and W. Wang, “PIMNet: a parallel, iterative and mimicking network for scene text recognition,” in ACM MM, 2021, pp. 2046–2055.
- Y. Du, Z. Chen, C. Jia, X. Yin, T. Zheng, C. Li, Y. Du, and Y.-G. Jiang, “SVTR: Scene text recognition with a single visual model,” arXiv, 2022.
- H. Zhang, G. Luo, J. Kang, S. Huang, X. Wang, and F.-Y. Wang, “Glalt: Global-local attention-augmented light Transformer for scene text recognition,” IEEE TNNLS, 2023.
- H. Shen, X. Gao, J. Wei, L. Qiao, Y. Zhou, Q. Li, and Z. Cheng, “Divide rows and conquer cells: Towards structure recognition for large tables,” in IJCAI, 2023, pp. 1369–1377.
- J. Wang, L. Jin, and K. Ding, “Lilt: A simple yet effective language-independent layout Transformer for structured document understanding,” arXiv, 2022.
- C. Da, C. Luo, Q. Zheng, and C. Yao, “Vision grid transformer for document layout analysis,” in ICCV, 2023, pp. 19 462–19 472.
- X. Yang, D. Yang, Y. Zhou, Y. Guo, and W. Wang, “Mask-guided stamp erasure for real document image,” in ICME. IEEE, 2023, pp. 1631–1636.
- G. Zeng, Y. Zhang, Y. Zhou, X. Yang, N. Jiang, G. Zhao, W. Wang, and X.-C. Yin, “Beyond OCR+ VQA: Towards end-to-end reading and reasoning for robust and accurate TextVQA,” PR, vol. 138, p. 109337, 2023.
- G. Zeng, Y. Zhang, Y. Zhou, B. Fang, G. Zhao, X. Wei, and W. Wang, “Filling in the blank: Rationale-augmented prompt tuning for textvqa,” in ACM MM, 2023, pp. 1261–1272.
- P. Lyu, M. Liao, C. Yao, W. Wu, and X. Bai, “Mask Textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,” in ECCV, 2018, pp. 67–83.
- M. Liao, P. Lyu, M. He, C. Yao, W. Wu, and X. Bai, “Mask TextSpotter: An end-to-end trainable neural network for spotting text with arbitrary shapes,” IEEE TPAMI, vol. 43, no. 2, pp. 532–548, 2021.
- M. Liao, G. Pang, J. Huang, T. Hassner, and X. Bai, “Mask Textspotter v3: Segmentation proposal network for robust scene text spotting,” in ECCV. Springer, 2020, pp. 706–722.
- Y. Liu, C. Shen, L. Jin, T. He, P. Chen, C. Liu, and H. Chen, “ABCNet v2: Adaptive bezier-curve network for real-time end-to-end text spotting,” arXiv, 2021.
- R. Liu, N. Lu, D. Chen, C. Li, Z. Yuan, and W. Peng, “PBFormer: Capturing complex scene text shape with polynomial band Transformer,” arXiv, 2023.
- S. Fang, Z. Mao, H. Xie, Y. Wang, C. Yan, and Y. Zhang, “ABINet++: Autonomous, bidirectional and iterative language modeling for scene text spotting,” IEEE TPAMI, 2022.
- S. Qin, A. Bissacco, M. Raptis, Y. Fujii, and Y. Xiao, “Towards unconstrained end-to-end text spotting,” ICCV, pp. 4703–4713, 2019.
- J. Wei, Y. Zhang, Y. Zhou, G. Zeng, Z. Qiao, Y. Guo, H. Wu, H. Wang, and W. Wang, “Textblock: Towards scene text spotting without fine-grained detection,” in ACM MM, 2022, pp. 5892–5902.
- L. Qiao, Y. Chen, Z. Cheng, Y. Xu, Y. Niu, S. Pu, and F. Wu, “MANGO: A mask attention guided one-stage scene text spotter,” in AAAI, vol. 35, no. 3, 2021, pp. 2467–2476.
- T. Wang, Y. Zhu, L. Jin, D. Peng, Z. Li, M. He, Y. Wang, and C. Luo, “Implicit feature alignment: learn to convert text recognizer to text spotter,” in CVPR, 2021, pp. 5973–5982.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in ICCV, 2017, pp. 2961–2969.
- X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable Transformers for end-to-end object detection,” 2021.
- W. Feng, W. He, F. Yin, X.-Y. Zhang, and C.-L. Liu, “Textdragon: An end-to-end framework for arbitrary shaped text spotting,” in ICCV, 2019, pp. 9076–9085.
- L. Xing, Z. Tian, W. Huang, and M. R. Scott, “Convolutional character networks,” in ICCV, 2019, pp. 9126–9136.
- Y. Baek, S. Shin, J. Baek, S. Park, J. Lee, D. Nam, and H. Lee, “Character region attention for text spotting,” in ECCV. Springer, 2020, pp. 504–521.
- W. Wang, E. Xie, X. Li, X. Liu, D. Liang, Z. Yang, T. Lu, and C. Shen, “PAN++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text,” IEEE TPAMI, vol. 44, no. 9, pp. 5349–5367, 2021.
- M. Huang, Y. Liu, Z. Peng, C. Liu, D. Lin, S. Zhu, N. Yuan, K. Ding, and L. Jin, “SwinTextSpotter: Scene text spotting via better synergy between text detection and text recognition,” in CVPR, 2022, pp. 4593–4603.
- D. Peng, X. Wang, Y. Liu, J. Zhang, M. Huang, S. Lai, J. Li, S. Zhu, D. Lin, C. Shen et al., “SPTS: single-point text spotting,” in ACM MM, 2022, pp. 4272–4281.
- Y. Liu, J. Zhang, D. Peng, M. Huang, X. Wang, J. Tang, C. Huang, D. Lin, C. Shen, X. Bai et al., “SPTS v2: single-point scene text spotting,” IEEE TPAMI, 2023.
- M. Ye, J. Zhang, S. Zhao, J. Liu, T. Liu, B. Du, and D. Tao, “DeepSolo: Let Transformer decoder with explicit points solo for text spotting,” in CVPR, 2023, pp. 19 348–19 357.
- M. Huang, J. Zhang, D. Peng, H. Lu, C. Huang, Y. Liu, X. Bai, and L. Jin, “ESTextSpotter: Towards better scene text spotting with explicit synergy in Transformer,” in ICCV, 2023, pp. 19 495–19 505.
- H. Li, P. Wang, and C. Shen, “Towards end-to-end text spotting with convolutional recurrent neural networks,” in ICCV, 2017, pp. 5238–5246.
- B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition,” IEEE TPAMI, vol. 39, no. 11, pp. 2298–2304, 2016.
- T. He, Z. Tian, W. Huang, C. Shen, Y. Qiao, and C. Sun, “An end-to-end Textspotter with explicit alignment and attention,” in CVPR, 2018, pp. 5020–5029.
- S. Fang, H. Xie, Y. Wang, Z. Mao, and Y. Zhang, “Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition,” in CVPR, 2021, pp. 7098–7107.
- J. Tang, S. Qiao, B. Cui, Y. Ma, S. Zhang, and D. Kanoulas, “You can even annotate text with voice: Transcription-only-supervised text spotting,” in ACM MM, 2022, pp. 4154–4163.
- E. Matthew, “Peters, mark neumann, mohit iyyer, matt gardner, christopher clark, kenton lee, luke zettlemoyer. deep contextualized word representations,” in NAACL, vol. 5, 2018.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional Transformers for language understanding,” 2019.
- A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text Transformer,” JMLR, vol. 21, no. 1, pp. 5485–5551, 2020.
- S. Shen, C. Li, X. Hu, Y. Xie, J. Yang, P. Zhang, Z. Gan, L. Wang, L. Yuan, C. Liu et al., “K-lite: Learning transferable visual models with external knowledge,” NeurIPS, vol. 35, pp. 15 558–15 573, 2022.
- W. Ma, S. Li, J. Zhang, C. H. Liu, J. Kang, Y. Wang, and G. Huang, “Borrowing knowledge from pre-trained language model: A new data-efficient visual learning paradigm,” in ICCV, 2023, pp. 18 786–18 797.
- M. Wang, J. Xing, J. Mei, Y. Liu, and Y. Jiang, “ActionCLIP: Adapting language-image pretrained models for video action recognition,” IEEE TNNLS, 2023.
- M. Fujitake, “DTrOCR: Decoder-only Transformer for optical character recognition,” 2024.
- W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv, 2023.
- C. Xue, W. Zhang, Y. Hao, S. Lu, P. H. Torr, and S. Bai, “Language matters: A weakly supervised vision-language pre-training approach for scene text detection and spotting,” in ECCV. Springer, 2022, pp. 284–302.
- M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” in KDD, vol. 96, no. 34, 1996, pp. 226–231.
- L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.-W. Hon, “Unified language model pre-training for natural language understanding and generation,” NeurIPS, vol. 32, 2019.
- D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu et al., “Icdar 2015 competition on robust reading,” in ICDAR. IEEE, 2015, pp. 1156–1160.
- C. K. Ch’ng and C. S. Chan, “Total-text: A comprehensive dataset for scene text detection and recognition,” in ICDAR, vol. 1. IEEE, 2017, pp. 935–942.
- Y. Liu, L. Jin, S. Zhang, and S. Zhang, “Detecting curve text in the wild: New dataset and new solution,” arXiv, 2017.
- A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localization in natural images,” in CVPR, 2016, pp. 2315–2324.
- Q. Jiang, J. Wang, D. Peng, C. Liu, and L. Jin, “Revisiting scene text recognition: A data perspective,” in ICCV, 2023, pp. 20 543–20 554.
- M. Yim, Y. Kim, H.-C. Cho, and S. Park, “Synthtiger: Synthetic text image generator towards better text recognition models,” in ICDAR. Springer, 2021, pp. 109–124.
- M. Li, T. Lv, J. Chen, L. Cui, Y. Lu, D. Florencio, C. Zhang, Z. Li, and F. Wei, “TrOCR: Transformer-based optical character recognition with pre-trained models,” in AAAI, vol. 37, no. 11, 2023, pp. 13 094–13 102.
- D. Bautista and R. Atienza, “Scene text recognition with permuted autoregressive sequence models,” in ECCV. Springer, 2022, pp. 178–196.
- Z. Kuang, H. Sun, Z. Li, X. Yue, T. H. Lin, J. Chen, H. Wei, Y. Zhu, T. Gao, W. Zhang et al., “MMOCR: A comprehensive toolbox for text detection, recognition and understanding,” in ACM MM, 2021, pp. 3791–3794.
- H. Li, P. Wang, C. Shen, and G. Zhang, “Show, attend and read: A simple and strong baseline for irregular text recognition,” in AAAI, vol. 33, no. 01, 2019, pp. 8610–8617.
- T. Wang, Y. Zhu, L. Jin, C. Luo, X. Chen, Y. Wu, Q. Wang, and M. Cai, “Decoupled attention network for text recognition,” in AAAI, vol. 34, no. 07, 2020, pp. 12 216–12 224.
- D. Yu, X. Li, C. Zhang, T. Liu, J. Han, J. Liu, and E. Ding, “Towards accurate scene text recognition with semantic reasoning networks,” in CVPR, 2020, pp. 12 113–12 122.
- Y. Wang, H. Xie, S. Fang, J. Wang, S. Zhu, and Y. Zhang, “From two to one: A new scene text recognizer with visual language modeling network,” in ICCV, 2021, pp. 14 194–14 203.
- B. Na, Y. Kim, and S. Park, “Multi-modal text recognition networks: Interactive enhancements between visual and semantic features,” in ECCV. Springer, 2022, pp. 446–463.
- J. Hao, Y. Wen, J. Deng, J. Gan, S. Ren, H. Tan, and X. Chen, “Eem: An end-to-end evaluation metric for scene text detection and recognition,” in ICDAR. Springer, 2021, pp. 95–108.
- R. OpenAI, “GPT-4 technical report,” arXiv, pp. 2303–08 774, 2023.
- Jiahao Lyu (9 papers)
- Jin Wei (16 papers)
- Gangyan Zeng (6 papers)
- Zeng Li (24 papers)
- Enze Xie (84 papers)
- Wei Wang (1793 papers)
- Yu Zhou (335 papers)