Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models (2311.16555v1)

Published 28 Nov 2023 in cs.CV

Abstract: Scene text detection techniques have garnered significant attention due to their wide-ranging applications. However, existing methods have a high demand for training data, and obtaining accurate human annotations is labor-intensive and time-consuming. As a solution, researchers have widely adopted synthetic text images as a complementary resource to real text images during pre-training. Yet there is still room for synthetic datasets to enhance the performance of scene text detectors. We contend that one main limitation of existing generation methods is the insufficient integration of foreground text with the background. To alleviate this problem, we present the Diffusion Model based Text Generator (DiffText), a pipeline that utilizes the diffusion model to seamlessly blend foreground text regions with the background's intrinsic features. Additionally, we propose two strategies to generate visually coherent text with fewer spelling errors. With fewer text instances, our produced text images consistently surpass other synthetic data in aiding text detectors. Extensive experiments on detecting horizontal, rotated, curved, and line-level texts demonstrate the effectiveness of DiffText in producing realistic text images.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. End-to-end object detection with transformers, in: European Conference on Computer Vision, Springer. pp. 213–229.
  2. An integrated artificial vision framework for assisting visually impaired users. Computer Vision and Image Understanding 149, 209–228.
  3. Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art, in: 2019 International Conference on Document Analysis and Recognition, IEEE. pp. 1571–1576.
  4. Total-text: toward orientation robustness in scene text detection. International Journal on Document Analysis and Recognition 23, 31–52.
  5. Disentangling writer and character styles for handwriting generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5977–5986.
  6. Chirodiff: Modelling chirographic data with diffusion models, in: The Eleventh International Conference on Learning Representations.
  7. Imagenet: A large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 248–255.
  8. Detecting text in natural scenes with stroke width transform, in: 2010 IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 2963–2970.
  9. Taming transformers for high-resolution image synthesis, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883.
  10. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7098–7107.
  11. Scrabblegan: Semi-supervised varying length handwritten text generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4324–4333.
  12. Synthetic data for text localisation in natural images, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2315–2324.
  13. Deep residual learning for image recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE. pp. 770–778.
  14. Domain adaptive multigranularity proposal network for text detection under extreme traffic scenes. Computer Vision and Image Understanding 233, 103709.
  15. Reading text in the wild with convolutional neural networks. International Journal of Computer Vision 116, 1–20.
  16. Improving diffusion models for scene text editing with dual encoders. arXiv preprint arXiv:2304.05568 .
  17. Ganwriting: Content-conditioned generation of styled handwritten word images, in: European Conference on Computer Vision, pp. 273–289.
  18. Icdar 2015 competition on robust reading, in: 13th International Conference on Document Analysis and Recognition, IEEE. pp. 1156–1160.
  19. Icdar 2013 robust reading competition, in: 12th International Conference on Document Analysis and Recognition, IEEE. pp. 1484–1493.
  20. Textboxes: a fast text detector with a single deep neural network, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4161–4167.
  21. Synthtext3d: synthesizing scene text images from 3d virtual worlds. Science China Information Sciences 63, 1–14.
  22. Real-time scene text detection with differentiable binarization, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 11474–11481.
  23. Character-aware models improve visual text rendering, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics. pp. 16270–16297.
  24. Scene text detection and recognition with advances in deep learning: a survey. International Journal on Document Analysis and Recognition 22, 143–162.
  25. Exploring the capacity of an orderless box discretization network for multi-orientation scene text detection. International Journal of Computer Vision 129, 1972–1992.
  26. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition 90, 337–345.
  27. Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 8048–8064.
  28. Omnidirectional scene text detection with sequential-free box discretization, in: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3052–3058.
  29. Scene text detection and recognition: The deep learning era. International Journal of Computer Vision 129, 161–184.
  30. Unrealtext: Synthesizing realistic scene text images from the unreal world. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 5488–5497.
  31. Decoupled weight decay regularization, in: 7th International Conference on Learning Representations.
  32. Glyphdraw: Learning to draw chinese characters in image synthesis models coherently. arXiv preprint arXiv:2303.17870 .
  33. Snoopertext: A text detection system for automatic indexing of urban scenes. Computer Vision and Image Understanding 122, 92–104.
  34. Image retrieval using textual cues, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 3040–3047.
  35. Icdar2019 robust reading challenge on multi-lingual scene text detection and recognition—rrc-mlt-2019, in: 2019 International Conference on Document Analysis and Recognition, IEEE. pp. 1582–1587.
  36. A method for text localization and recognition in real-world images, in: Proceedings of the 10th Asian Conference on Computer Vision-Volume Part III, pp. 770–783.
  37. Real-time scene text localization and recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE. pp. 3538–3545.
  38. Wordstylist: Styled verbatim handwritten text generation with latent diffusion models, in: Document Analysis and Recognition - 2023 - 17th International Conference, San José, CA, USA, August 21-26, 2023, Proceedings, Part II, Springer. pp. 384–401.
  39. Poisson image editing. ACM Transactions on Graphics 22, 313–318.
  40. Learning transferable visual models from natural language supervision, in: International Conference on Machine Learning, PMLR. pp. 8748–8763.
  41. High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695.
  42. U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer. pp. 234–241.
  43. De-rendering stylized texts, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1076–1085.
  44. Textocr: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8802–8812.
  45. Deep unsupervised learning using nonequilibrium thermodynamics, in: International Conference on Machine Learning, PMLR. pp. 2256–2265.
  46. Strive: Scene text replacement in videos, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14549–14558.
  47. Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt, in: 2019 International Conference on Document Analysis and Recognition, IEEE. pp. 1557–1562.
  48. A scene-text synthesis engine achieved through learning from decomposed real-world data. arXiv preprint arXiv:2209.02397 .
  49. Learning shape-aware embedding for scene text detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4234–4243.
  50. Mobile visual search on printed documents using text and low bit-rate features, in: 2011 18th IEEE International Conference on Image Processing, IEEE. pp. 2601–2604.
  51. Attention is all you need. Advances in Neural Information Processing Systems 30.
  52. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140 .
  53. End-to-end text recognition with convolutional neural networks, in: Proceedings of the 21st International Conference on Pattern Recognition, IEEE. pp. 3304–3308.
  54. Shape robust text detection with progressive scale expansion network, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9336–9345.
  55. Aesthetic text logo synthesis via content-aware layout inferring, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2436–2445.
  56. Editing text in the wild, in: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1500–1508.
  57. Dg-font: Deformable generative networks for unsupervised font generation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5130–5140.
  58. A learning-based text synthesis engine for scene text detection., in: British Machine Vision Conference, p. 94.
  59. Dptext-detr: Towards better scene text detection with dynamic points in transformer, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 3241–3249.
  60. Text extraction from scene images by character appearance and structure modeling. Computer Vision and Image Understanding 117, 182–194.
  61. Synthtiger: Synthetic text image generator towards better text recognition models, in: International Conference on Document Analysis and Recognition, Springer. pp. 109–124.
  62. Verisimilar image synthesis for accurate detection and recognition of texts in scenes, in: Proceedings of the European Conference on Computer Vision, pp. 249–266.
  63. Scene text synthesis for efficient and effective deep network training. arXiv preprint arXiv:1901.09193 .
  64. Deep relational reasoning graph network for arbitrary shape text detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9699–9708.
  65. Multi-oriented text detection with fully convolutional networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4159–4167.
  66. East: an efficient and accurate scene text detector, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5551–5560.
  67. Fourier contour embedding for arbitrary-shaped text detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3123–3131.
  68. Conditional text image generation with diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14235–14245.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ling Fu (13 papers)
  2. Zijie Wu (15 papers)
  3. Yingying Zhu (39 papers)
  4. Yuliang Liu (82 papers)
  5. Xiang Bai (222 papers)

Summary

We haven't generated a summary for this paper yet.