PEAN: A Diffusion-Based Prior-Enhanced Attention Network for Scene Text Image Super-Resolution (2311.17955v3)
Abstract: Scene text image super-resolution (STISR) aims at simultaneously increasing the resolution and readability of low-resolution scene text images, thus boosting the performance of the downstream recognition task. Two factors in scene text images, visual structure and semantic information, affect the recognition performance significantly. To mitigate the effects from these factors, this paper proposes a Prior-Enhanced Attention Network (PEAN). Specifically, an attention-based modulation module is leveraged to understand scene text images by neatly perceiving the local and global dependence of images, despite the shape of the text. Meanwhile, a diffusion-based module is developed to enhance the text prior, hence offering better guidance for the SR network to generate SR images with higher semantic accuracy. Additionally, a multi-task learning paradigm is employed to optimize the network, enabling the model to generate legible SR images. As a result, PEAN establishes new SOTA results on the TextZoom benchmark. Experiments are also conducted to analyze the importance of the enhanced text prior as a means of improving the performance of the SR network. Code is available at https://github.com/jdfxzzy/PEAN.
- Identifying License Plates in Distorted Vehicle Images: Detecting Distorted Vehicle Licence Plates Using a Novel Preprocessing Methods With Hybrid Feature Descriptors. IEEE Intelligent Transportation Systems Magazine 15, 2 (2023), 6–25.
- Darwin Bautista and Rowel Atienza. 2022. Scene Text Recognition with Permuted Autoregressive Sequence Models. In Proceedings of the European Conference on Computer Vision. 178–196.
- Scene text telescope: Text-focused scene image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12026–12035.
- Text Gestalt: Stroke-aware scene text image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence. 285–293.
- Conditional Positional Encodings for Vision Transformers. In Proceedings of the International Conference on Learning Representations.
- Pixel Recursive Super Resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5449–5458.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.
- Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 2 (2016), 295–307.
- CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12114–12124.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations.
- TSRGAN: Real-world text image super-resolution based on adversarial learning and triplet attention. Neurocomputing 455 (2021), 88–96.
- Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7098–7107.
- Generative Diffusion Prior for Unified Image Restoration and Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9935–9946.
- Implicit Diffusion Models for Continuous Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10021–10030.
- Masked Diffusion Transformer is a Strong Image Synthesizer. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 23164–23173.
- Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning. 369–376.
- Towards Robust Scene Text Image Super-resolution via Explicit Location Enhancement. In Proceedings of the International Joint Conference on Artificial Intelligence. 782–790.
- Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems. 6840–6851.
- CCNet: Criss-Cross Attention for Semantic Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 6 (2023), 6896–6908.
- Spatial Transformer Networks. In Proceedings of the Advances in Neural Information Processing Systems. 2017–2025.
- ICDAR 2015 competition on Robust Reading. In Proceedings of IEEE International Conference on Document Analysis and Recognition. 1156–1160.
- Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations.
- Similarity of Neural Network Representations Revisited. In Proceedings of the International Conference on Machine Learning. 3519–3529.
- Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5835–5843.
- Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 105–114.
- TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image Super-Resolution. arXiv preprint arXiv:2308.06743 (2023).
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations.
- Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognition 90 (2019), 109–118.
- Text prior guided scene text image super-resolution. IEEE Transactions on Image Processing 32 (2023), 1341–1353.
- A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5911–5920.
- Scene Text Recognition using Higher Order Language Priors. In Proceedings of the British Machine Vision Conference. 1–11.
- Diganta Misra. 2020. Mish: A Self Regularized Non-Monotonic Activation Function. In Proceedings of the British Machine Vision Conference.
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In Proceedings of the International Conference on Machine Learning. 16784–16804.
- Scene Text Image Super-resolution based on Text-conditional Diffusion Models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1485–1495.
- Namuk Park and Songkuk Kim. 2022. How Do Vision Transformers Work?. In Proceedings of the International Conference on Learning Representations.
- Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems. 8024–8035.
- Recognizing Text with Perspective Distortion in Natural Scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 569–576.
- PIMNet: A Parallel, Iterative and Mimicking Network for Scene Text Recognition. In Proceedings of the ACM International Conference on Multimedia. 2046–2055.
- Do Vision Transformers See Like Convolutional Neural Networks?. In Proceedings of the Advances in Neural Information Processing Systems. 12116–12128.
- RoadText-1K: Text Detection & Recognition Dataset for Driving Videos. In Proceedings of the IEEE International Conference on Robotics and Automation. 11074–11080.
- Multiscale Structure Guided Diffusion for Image Deblurring. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10721–10733.
- High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10674–10685.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems. 36479–36494.
- Image Super-Resolution via Iterative Refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 4 (2023), 4713–4726.
- An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 11 (2016), 2298–2304.
- ASTER: An attentional scene text recognizer with flexible rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 9 (2018), 2035–2048.
- Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1874–1883.
- Text-Enhanced Scene Image Super-Resolution via Stroke Mask and Orthogonal Attention. IEEE Transactions on Circuits and Systems for Video Technology 33, 11 (2023), 6317–6330.
- Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations.
- Efficient Scene Text Image Super-Resolution with Semantic Guidance. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 3160–3164.
- Stripformer: Strip Transformer for Fast Image Deblurring. In Proceedings of the European Conference on Computer Vision. 146–162.
- Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems. 5998–6008.
- Omni Aggregation Networks for Lightweight Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22378–22387.
- Multi-granularity Prediction for Scene Text Recognition. In Proceedings of the European Conference on Computer Vision. 339–355.
- Scene text image super-resolution in the wild. In Proceedings of the European Conference on Computer Vision. 650–666.
- From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14174–14183.
- Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600–612.
- DiffIR: Efficient Diffusion Model for Image Restoration. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13095–13105.
- DocDiff: Document Enhancement via Residual Diffusion Models. In Proceedings of the ACM International Conference on Multimedia. 2795–2806.
- Towards Accurate Scene Text Recognition with Semantic Reasoning Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12110–12119.
- Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-Resolution. In Proceedings of the ACM International Conference on Multimedia. 2168–2179.
- Residual Dense Network for Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2472–2481.
- Scene text image super-resolution via parallelly contextual attention network. In Proceedings of the ACM International Conference on Multimedia. 2908–2917.
- C3-STISR: Scene Text Image Super-resolution with Triple Clues. In Proceedings of the International Joint Conference on Artificial Intelligence. 1707–1713.
- Towards Authentic Face Restoration with Iterative Diffusion Models and Beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7312–7322.
- Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2940–2944.
- SRFormer: Permuted Self-Attention for Single Image Super-Resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12734–12745.
- SP-ViT: Learning 2D Spatial Priors for Vision Transformers. In Proceedings of the British Machine Vision Conference.
- Text Image Inpainting via Global Structure-Guided Diffusion Models. In Proceedings of the AAAI Conference on Artificial Intelligence. 7775–7783.
- Improving Scene Text Image Super-Resolution via Dual Prior Modulation Network. In Proceedings of the AAAI Conference on Artificial Intelligence. 3843–3851.
- Gradient-Based Graph Attention for Scene Text Image Super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence. 3861–3869.
- Scene Text Image Super-Resolution Through Multi-Scale Interaction of Structural and Semantic Priors. IEEE Transactions on Artificial Intelligence 1, 1 (2024), 1–11.
- Zuoyan Zhao (3 papers)
- Hui Xue (109 papers)
- Pengfei Fang (29 papers)
- Shipeng Zhu (4 papers)