Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scene Text Image Super-resolution based on Text-conditional Diffusion Models (2311.09759v2)

Published 16 Nov 2023 in cs.CV

Abstract: Scene Text Image Super-resolution (STISR) has recently achieved great success as a preprocessing method for scene text recognition. STISR aims to transform blurred and noisy low-resolution (LR) text images in real-world settings into clear high-resolution (HR) text images suitable for scene text recognition. In this study, we leverage text-conditional diffusion models (DMs), known for their impressive text-to-image synthesis capabilities, for STISR tasks. Our experimental results revealed that text-conditional DMs notably surpass existing STISR methods. Especially when texts from LR text images are given as input, the text-conditional DMs are able to produce superior quality super-resolution text images. Utilizing this capability, we propose a novel framework for synthesizing LR-HR paired text image datasets. This framework consists of three specialized text-conditional DMs, each dedicated to text image synthesis, super-resolution, and image degradation. These three modules are vital for synthesizing distinct LR and HR paired images, which are more suitable for training STISR methods. Our experiments confirmed that these synthesized image pairs significantly enhance the performance of STISR methods in the TextZoom evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3113–3122, 2021.
  2. Label-efficient semantic segmentation with diffusion models. In Proceedings of the International Conference on Learning Representations, 2022.
  3. Scene text recognition with permuted autoregressive sequence models. In Proceedings of the European Conference on Computer Vision, page 178–196, 2022.
  4. Degradation learning and skip-transformer for blind face restoration. Frontiers in Signal Processing, 3, 2023.
  5. Scene text telescope: Text-focused scene image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12026–12035, June 2021.
  6. Text gestalt: Stroke-aware scene text image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 285–293, 2022.
  7. Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In Proceedings of the IEEE International Conference on Document Analysis and Recognition, pages 1571–1576, 2019.
  8. Diffusion models in vision: A survey. arXiv preprint arXiv:2209.04747, 2022.
  9. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, volume 34, 2021.
  10. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2015.
  11. Boosting optical character recognition: A super-resolution approach. arXiv preprint arXiv:1506.02211, 2015.
  12. Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7094–7103, 2021.
  13. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning, page 369–376, 2006.
  14. Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2315–2324, 2016.
  15. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  16. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23:47–1, 2022.
  17. Classifier-free diffusion guidance. In NeurIPS Workshop on Deep Generative Models and Downstream Applications, 2021.
  18. Synthetic data and artificial neural networks for natural scene text recognition. In NeurIPS Workshop on Deep Learning, 2014.
  19. Text2human: Text-driven controllable human image generation. ACM Transactions on Graphics, 41(4), 2022.
  20. Icdar competition on robust reading. In Proceedings of 13th International Conference on Document Analysis and Recognition, pages 1156–1160, 2015.
  21. Icdar competition on robust reading. In Proceedings of 12th International Conference on Document Analysis and Recognition, pages 1484–1493, 2013.
  22. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8110–8119, 2020.
  23. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1646–1654, 2016.
  24. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 624–632, 2017.
  25. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 2017.
  26. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017.
  27. Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognition, 90:109–118, 2019.
  28. Text prior guided scene text image super-resolution. arXiv preprint arXiv:2106.15368, 2021.
  29. A text attention network for spatial deformation robust scene text image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5911–5920, June 2022.
  30. Shunta Maeda. Unpaired image super-resolution using pseudo-supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 291–300, 2020.
  31. Scene text recognition using higher order language priors. In Proceedings of the British Machine Vision Conference, 2009.
  32. Plugnet: Degradation aware scene text recognition supervised by a pluggable super-resolution unit. In Proceedings of the European Conference on Computer Vision, pages 158–174, 2020.
  33. Icdar2019 robust reading challenge on multi-lingual scene text detection and recognition—rrc-mlt-2019. In Proceedings of the IEEE International conference on document analysis and recognition, pages 1582–1587, 2019.
  34. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In Proceedings of the International Conference on Machine Learning, volume 162, 2022.
  35. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, volume 139, pages 8162–8171, 2021.
  36. Collaborative deep learning for super-resolving blurry text images. IEEE Transactions on Computational Imaging, 6:778–790, 2020.
  37. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  38. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, June 2022.
  39. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention, pages 234–241, 2015.
  40. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  41. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–14, 2022.
  42. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11):2298–2304, 2017.
  43. Aster: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9):2035–2048, 2019.
  44. Icdar2017 competition on reading chinese text in the wild (rctw-17). In Proceedings of the IEEE international conference on document analysis and recognition, volume 1, pages 1429–1434, 2017.
  45. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, volume 37, pages 2256–2265, 2015.
  46. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  47. Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In Proceedings of the IEEE International Conference on Document Analysis and Recognition, pages 1557–1562, 2019.
  48. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
  49. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140, 2016.
  50. End-to-end scene text recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 1457–1464, 2011.
  51. Scene text image super-resolution in the wild. In Proceedings of the European Conference on Computer Vision, pages 650–666, 2020.
  52. Textsr: Content-aware text super-resolution guided by recognition. CoRR, abs/1909.07113, 2019.
  53. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 1905–1914, 2021.
  54. Learning to super-resolve blurry face and text images. In Proceedings of the IEEE International Conference on Computer Vision, pages 251–260, 2017.
  55. Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796, 2022.
  56. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4791–4800, 2021.
  57. Icdar 2019 robust reading challenge on reading chinese text on signboard. In Proceedings of the IEEE International conference on document analysis and recognition, pages 1577–1581, 2019.
  58. Uber-text: A large-scale dataset for optical character recognition from street-level imagery. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, volume 2017, page 5, 2017.
  59. Multiple cycle-in-cycle generative adversarial networks for unsupervised image super-resolution. IEEE transactions on Image Processing, 29:1101–1112, 2019.
  60. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2018.
  61. Scene text image super-resolution via parallelly contextual attention network. In Proceedings of the ACM International Conference on Multimedia, page 2908–2917, 2021.
  62. C3-stisr: Scene text image super-resolution with triple clues. In Proceedings of the International Joint Conferences on Artificial Intelligence, pages 1707–1713, 2022.
  63. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2223–2232, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Chihiro Noguchi (8 papers)
  2. Shun Fukuda (3 papers)
  3. Masao Yamanaka (4 papers)
Citations (6)