Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition (2402.13643v1)
Abstract: Scene text recognition is a rapidly developing field that faces numerous challenges due to the complexity and diversity of scene text, including complex backgrounds, diverse fonts, flexible arrangements, and accidental occlusions. In this paper, we propose a novel approach called Class-Aware Mask-guided feature refinement (CAM) to address these challenges. Our approach introduces canonical class-aware glyph masks generated from a standard font to effectively suppress background and text style noise, thereby enhancing feature discrimination. Additionally, we design a feature alignment and fusion module to incorporate the canonical mask guidance for further feature refinement for text recognition. By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion, ultimately leading to improved recognition performance. We first evaluate CAM on six standard text recognition benchmarks to demonstrate its effectiveness. Furthermore, CAM exhibits superiority over the state-of-the-art method by an average performance gain of 4.1% across six more challenging datasets, despite utilizing a smaller model size. Our study highlights the importance of incorporating canonical mask guidance and aligned feature refinement techniques for robust scene text recognition. The code is available at https://github.com/MelosY/CAM.
- Cascaded segmentation-detection networks for text-based traffic sign detection, IEEE transactions on intelligent transportation systems 19 (2017) 209–219.
- ICDAR 2019 robust reading challenge on reading chinese text on signboard, in: Proc. Int. Conf. on Document Analysis and Recognition, IEEE, 2019, pp. 1577–1581.
- Textboxes++: A single-shot oriented scene text detector, IEEE Trans. Image Processing 27 (2018) 3676–3690.
- End-to-end page-level assessment of handwritten text recognition, Pattern Recognition 142 (2023) 109695.
- Separating content from style using adversarial learning for recognizing text in the wild, Int. J. Comput. Vision 129 (2021) 960–976.
- Synthetically supervised feature learning for scene text recognition, in: European Conf. Comput. Vision, Springer, 2018, pp. 449–465.
- Y. Wang, Z. Lian, Exploring font-independent features for scene text recognition, in: ACM Int. Conf. Multimedia, ACM, 2020, pp. 1900–1920.
- Background-insensitive scene text recognition with text semantic segmentation, in: European Conf. Comput. Vision, Springer, 2022, pp. 163–182.
- Recognizing multiple text sequences from an image by pure end-to-end learning, in: Int. Conf. Pattern Recognition, IEEE, 2020, pp. 7058–7065.
- An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, IEEE Trans. Pattern Anal. Mach. Intell. 39 (2017) 2298–2304.
- Reading scene text in deep convolutional sequences, in: AAAI Conf. on Artificial Intelligence, 2016.
- Regularizing CTC in expectation-maximization framework with application to handwritten text recognition, in: International Joint Conference on Neural Networks, IJCNN, 2021, pp. 1–7.
- ASTER: an attentional scene text recognizer with flexible rectification, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2019) 2035–2048.
- Symmetry-constrained rectification network for scene text recognition, in: Int. Conf. Comput. Vision, 2019, pp. 9146–9155.
- Focusing attention: Towards accurate text recognition in natural images, in: Int. Conf. Comput. Vision, 2017, pp. 5086–5094.
- Decoupled attention network for text recognition, in: AAAI Conf. on Artificial Intelligence, 2020, pp. 12216–12224.
- Reading and writing: Discriminative and generative modeling for self-supervised text recognition, in: ACM Int. Conf. Multimedia, 2022.
- Toward understanding wordart: Corner-guided transformer for scene text recognition, in: European Conf. Comput. Vision, volume 13688, 2022, pp. 303–321.
- STAN: A sequential transformation attention-based network for scene text recognition, Pattern Recognition 111 (2021) 107692.
- MASTER: multi-aspect non-local network for scene text recognition, Pattern Recognition 117 (2021) 107980.
- Semi-supervised scene text recognition, IEEE Trans. Image Processing 30 (2021) 3005–3016.
- Towards open-set text recognition via label-to-prototype learning, Pattern Recognition 134 (2023) 109109.
- Scene text detection and recognition: recent advances and future trends, Frontiers Comput. Sci. 10 (2016) 19–36.
- Scene text detection and recognition: The deep learning era, Int. J. Comput. Vision 129 (2021) 161–184.
- A two-level rectification attention network for scene text recognition, IEEE Trans. Multim. 25 (2023) 2404–2414.
- C. Wang, C. Liu, Multi-branch guided attention network for irregular text recognition, Neurocomputing 425 (2021) 278–289.
- MORAN: A multi-object rectified attention network for scene text recognition, Pattern Recognition 90 (2019) 109–118.
- Tps++: Attention-enhanced thin-plate spline for scene text recognition, Int. Joint Conf. on Artificial Intelligence (2023).
- Show, attend and read: A simple and strong baseline for irregular text recognition, in: AAAI Conf. on Artificial Intelligence, 2019, pp. 8610–8617.
- AON: towards arbitrarily-oriented text recognition, in: Comput. Vision Pattern Recognition, 2018, pp. 5571–5579.
- Scene text recognition from two-dimensional perspective, in: AAAI Conf. on Artificial Intelligence, 2019, pp. 8714–8721.
- Rethinking text segmentation: A novel dataset and a text-specific refinement approach, in: Comput. Vision Pattern Recognition, 2021, pp. 12045–12055.
- C. K. Chng, C. S. Chan, Total-text: A comprehensive dataset for scene text detection and recognition, in: Proc. Int. Conf. on Document Analysis and Recognition, 2017, pp. 935–942.
- Robust scene text recognition with automatic rectification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4168–4176.
- Attention is all you need, in: Neural Inform. Process. Syst., 2017, pp. 5998–6008.
- D. Bautista, R. Atienza, Scene text recognition with permuted autoregressive sequence models, in: European Conf. Comput. Vision, 2022.
- Levenshtein OCR, in: European Conf. Comput. Vision, volume 13688, 2022, pp. 322–338.
- Multi-granularity prediction for scene text recognition, in: European Conf. Comput. Vision, 2022.
- Convnext v2: Co-designing and scaling convnets with masked autoencoders, in: Comput. Vision Pattern Recognition, 2023.
- U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer, 2015, pp. 234–241.
- Vision transformer with deformable attention, in: Comput. Vision Pattern Recognition, IEEE, 2022, pp. 4784–4793.
- Synthetic data for text localisation in natural images, in: Comput. Vision Pattern Recognition, 2016.
- Synthetic data and artificial neural networks for natural scene text recognition, in: Workshop on Deep Learning, NIPS, 2014.
- Benchmarking chinese text recognition: Datasets, baselines, and an empirical study, arXiv preprint arXiv:2112.15093 (2021).
- Top-down and bottom-up cues for scene text recognition, in: Comput. Vision Pattern Recognition, 2012.
- End-to-end scene text recognition, in: Int. Conf. Comput. Vision, 2011, pp. 1457–1464.
- Icdar 2013 robust reading competition, in: Proc. Int. Conf. on Document Analysis and Recognition, 2013, pp. 1484–1493.
- ICDAR 2015 competition on robust reading, in: Proc. Int. Conf. on Document Analysis and Recognition, 2015, pp. 1156–1160.
- Coco-text: Dataset and benchmark for text detection and recognition in natural images, CoRR abs/1601.07140 (2016).
- Recognizing text with perspective distortion in natural scenes, in: Int. Conf. Comput. Vision, 2013, pp. 569–576.
- A robust arbitrary text detection system for natural scene images, Expert Syst. Appl. 41 (2014) 8027–8048.
- Curved scene text detection via transverse and longitudinal sequence connection, Pattern Recognition 90 (2019) 337–345.
- From two to one: A new scene text recognizer with visual language modeling network, in: Int. Conf. Comput. Vision, 2021, pp. 14174–14183.
- SVTR: scene text recognition with a single visual model, in: Int. Joint Conf. on Artificial Intelligence, 2022, pp. 884–890.
- What is wrong with scene text recognition model comparisons? dataset and model analysis, in: Int. Conf. Comput. Vision, 2019.
- Primitive representation learning for scene text recognition, in: Comput. Vision Pattern Recognition, 2021.
- On recognizing texts of arbitrary shapes with 2d self-attention, in: Proc. Conf. Comput. Vision Pattern Recognition Workshops, 2020, pp. 2326–2335.
- Text is text, no matter what: Unifying text recognition using knowledge distillation, in: Int. Conf. Comput. Vision, IEEE, 2021, pp. 963–972.
- Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes, IEEE Trans. Pattern Anal. Mach. Intell. 43 (2021) 532–548.
- Context-based contrastive learning for scene text recognition, in: AAAI Conf. on Artificial Intelligence, 2022, pp. 3353–3361.
- Self-supervised implicit glyph attention for text recognition, in: Comput. Vision Pattern Recognition, 2023, pp. 15285–15294.
- Pimnet: A parallel, iterative and mimicking network for scene text recognition, in: ACM Int. Conf. Multimedia, 2021.
- Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition, in: Comput. Vision Pattern Recognition, 2021.
- Joint visual semantic reasoning: Multi-stage decoder for text recognition, in: Int. Conf. Comput. Vision, 2021, pp. 14920–14929.
- SGBANet: Semantic GAN and balanced attention network for arbitrarily oriented scene text recognition, in: European Conf. Comput. Vision, Springer, 2022, pp. 464–480.
- SEED: semantics enhanced encoder-decoder framework for scene text recognition, in: Comput. Vision Pattern Recognition, 2020, pp. 13525–13534.
- MASTER: multi-aspect non-local network for scene text recognition, Pattern Recognit. 117 (2021) 107980.
- Scene text telescope: Text-focused scene image super-resolution, Comput. Vision Pattern Recognition (2021) 12021–12030.
- Mingkun Yang (16 papers)
- Biao Yang (48 papers)
- Minghui Liao (29 papers)
- Yingying Zhu (39 papers)
- Xiang Bai (221 papers)