Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting (2403.00303v2)

Published 1 Mar 2024 in cs.CV

Abstract: In recent years, text-image joint pre-training techniques have shown promising results in various tasks. However, in Optical Character Recognition (OCR) tasks, aligning text instances with their corresponding text regions in images poses a challenge, as it requires effective alignment between text and OCR-Text (referring to the text in images as OCR-Text to distinguish from the text in natural language) rather than a holistic understanding of the overall image content. In this paper, we propose a new pre-training method called OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text found in images to a uniform style based on the text prompt. With ODM, we achieve better alignment between text and OCR-Text and enable pre-trained models to adapt to the complex and diverse styles of scene text detection and spotting tasks. Additionally, we have designed a new labeling generation method specifically for ODM and combined it with our proposed Text-Controller module to address the challenge of annotation costs in OCR tasks, allowing a larger amount of unlabeled data to participate in pre-training. Extensive experiments on multiple public datasets demonstrate that our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks. Code is available at https://github.com/PriNing/ODM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  2. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 71–79, 2018.
  3. Efficient self-supervised vision pretraining with local masked reconstruction. arXiv preprint arXiv:2206.00790, 2022.
  4. Textdiffuser: Diffusion models as text painters. Advances in Neural Information Processing Systems, 36, 2024.
  5. Total-text: toward orientation robustness in scene text detection. International Journal on Document Analysis and Recognition (IJDAR), 23(1):31–52, 2020.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  7. Textdragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9076–9085, 2019.
  8. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2315–2324, 2016.
  9. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  10. Mae: Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021a.
  11. Most: A multi-oriented scene text detector with localization refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8813–8822, 2021b.
  12. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5020–5029, 2018.
  13. Swintextspotter: Scene text spotting via better synergy between text detection and text recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4593–4603, 2022.
  14. Estextspotter: Towards better scene text spotting with explicit synergy in transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19495–19505, 2023.
  15. Icdar 2015 competition on robust reading. In 2015 13th international conference on document analysis and recognition (ICDAR), pages 1156–1160. IEEE, 2015.
  16. Towards unified scene text spotting based on sequence generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15223–15232, 2023.
  17. Pp-ocrv3: More attempts for the improvement of ultra lightweight ocr system. arXiv preprint arXiv:2206.03001, 2022.
  18. Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI conference on artificial intelligence, 2017a.
  19. Textboxes: A fast text detector with a single deep neural network. In Proceedings of the AAAI conference on artificial intelligence, 2017b.
  20. Textboxes++: A single-shot oriented scene text detector. IEEE transactions on image processing, 27(8):3676–3690, 2018.
  21. Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 706–722. Springer, 2020a.
  22. Real-time scene text detection with differentiable binarization. In Proceedings of the AAAI conference on artificial intelligence, pages 11474–11481, 2020b.
  23. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):919–931, 2022.
  24. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  25. Fots: Fast oriented text spotting with a unified network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5676–5685, 2018.
  26. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 90:337–345, 2019.
  27. Abcnet: Real-time scene text spotting with adaptive bezier-curve network. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9809–9818, 2020.
  28. Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8048–8064, 2021.
  29. Spts v2: single-point scene text spotting. arXiv preprint arXiv:2301.01635, 2023.
  30. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European conference on computer vision (ECCV), pages 67–83, 2018.
  31. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2):532–548, 2021.
  32. Arbitrary-oriented scene text detection via rotation proposals. IEEE transactions on multimedia, 20(11):3111–3122, 2018.
  33. Glyphdraw: Learning to draw chinese characters in image synthesis models coherently. arXiv preprint arXiv:2303.17870, 2023.
  34. Spts: single-point text spotting. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4272–4281, 2022.
  35. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  36. Ocr-vqgan: Taming text-within-image generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3689–3698, 2023.
  37. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  38. Detecting oriented text in natural images by linking segments. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2550–2558, 2017.
  39. Very deep convolutional networks for large-scale image recognition. 2014.
  40. Vision-language pre-training for boosting scene text detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15681–15691, 2022.
  41. Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1557–1562. IEEE, 2019.
  42. Few could be better than all: Feature sampling and grouping for scene text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4572, 2022.
  43. Anytext: Multilingual visual text generation and editing. arXiv preprint arXiv:2311.03054, 2023.
  44. Attention is all you need. In NeurIPS, 2017.
  45. Self-attention based text knowledge mining for text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5983–5992, 2021.
  46. Shape robust text detection with progressive scale expansion network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9336–9345, 2019a.
  47. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8440–8449, 2019b.
  48. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
  49. Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9653–9663, 2022.
  50. Textfield: Learning a deep direction field for irregular scene text detection. IEEE Transactions on Image Processing, 28(11):5566–5579, 2019.
  51. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. arXiv preprint arXiv:2012.14740, 2020.
  52. Language matters: A weakly supervised vision-language pre-training approach for scene text detection and spotting. In European Conference on Computer Vision, pages 284–302. Springer, 2022.
  53. Class-aware mask-guided feature refinement for scene text recognition. Pattern Recognition, 149:110244, 2024.
  54. Deepsolo: Let transformer decoder with explicit points solo for text spotting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19348–19357, 2023.
  55. Turning a clip model into a scene text detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6978–6988, 2023a.
  56. Structextv2: Masked visual-textual prediction for document image pre-training. arXiv preprint arXiv:2303.00289, 2023b.
  57. East: an efficient and accurate scene text detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 5551–5560, 2017.
  58. Fourier contour embedding for arbitrary-shaped text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3123–3131, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chen Duan (5 papers)
  2. Pei Fu (14 papers)
  3. Shan Guo (3 papers)
  4. Qianyi Jiang (7 papers)
  5. Xiaoming Wei (44 papers)
Citations (3)