Sequential Visual and Semantic Consistency for Semi-supervised Text Recognition (2402.15806v1)
Abstract: Scene text recognition (STR) is a challenging task that requires large-scale annotated data for training. However, collecting and labeling real text images is expensive and time-consuming, which limits the availability of real data. Therefore, most existing STR methods resort to synthetic data, which may introduce domain discrepancy and degrade the performance of STR models. To alleviate this problem, recent semi-supervised STR methods exploit unlabeled real data by enforcing character-level consistency regularization between weakly and strongly augmented views of the same image. However, these methods neglect word-level consistency, which is crucial for sequence recognition tasks. This paper proposes a novel semi-supervised learning method for STR that incorporates word-level consistency regularization from both visual and semantic aspects. Specifically, we devise a shortest path alignment module to align the sequential visual features of different views and minimize their distance. Moreover, we adopt a reinforcement learning framework to optimize the semantic similarity of the predicted strings in the embedding space. We conduct extensive experiments on several standard and challenging STR benchmarks and demonstrate the superiority of our proposed method over existing semi-supervised STR methods.
- Adaptive consistency regularization for semi-supervised transfer learning. In Proc. Conf. Comput. Vision Pattern Recognition, pages 6923–6932, 2021.
- Prouda: Progressive unsupervised data augmentation for semi-supervised 3d object detection on point cloud. Pattern Recognition Letters, 170:64–69, 2023.
- What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proc. Int. Conf. Comput. Vision, pages 4715–4723, 2019.
- What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels. In CVPR, 2021.
- Integrating scene text and visual appearance for fine-grained image classification. IEEE Access, 2018.
- D. Bautista and R. Atienza. Scene text recognition with permuted autoregressive sequence models. In European Conf. Comput. Vision, 2022.
- Text is text, no matter what: Unifying text recognition using knowledge distillation. In Proc. Int. Conf. Comput. Vision, pages 963–972. IEEE, 2021.
- Joint visual semantic reasoning: Multi-stage decoder for text recognition. In Proc. Int. Conf. Comput. Vision, pages 14920–14929, 2021.
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.
- Continual semi-supervised learning through contrastive interpolation consistency. Pattern Recognition Letters, 162:9–14, 2022.
- Total-text: A comprehensive dataset for scene text detection and recognition. In Proc. ICDAR, pages 935–942, 2017.
- Levenshtein OCR. In European Conf. Comput. Vision, volume 13688, pages 322–338, 2022.
- SVTR: scene text recognition with a single visual model. In L. D. Raedt, editor, Int. Joint Conf. on Artificial Intelligence, pages 884–890, 2022.
- Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In CVPR, 2021.
- Weakly-supervised semantic segmentation via online pseudo-mask correcting. Pattern Recognition Letters, 165:33–38, 2023.
- Neural machine translation with gumbel-greedy decoding. In AAAI Conf. on Artificial Intelligence, volume 32, 2018.
- Self-supervised implicit glyph attention for text recognition. In CVPR, 2023.
- On calibration of modern neural networks. In Int. Conf. Mach. Learning, pages 1321–1330, 2017.
- Synthetic data for text localisation in natural images. In CVPR, 2016.
- Visual semantics allow for textual reasoning better in scene text recognition. In AAAI Conf. on Artificial Intelligence, pages 888–896, 2022.
- Synthetic data and artificial neural networks for natural scene text recognition. CoRR, abs/1406.2227, 2014.
- Categorical reparameterization with gumbel-softmax. In Int. Conf. on Learning Representations, 2017.
- ICDAR 2015 competition on robust reading. In Proc. ICDAR, pages 1156–1160, 2015.
- Icdar 2013 robust reading competition. In Proc. ICDAR, pages 1484–1493, 2013.
- On recognizing texts of arbitrary shapes with 2d self-attention. In Proc. Conf. Comput. Vision Pattern Recognition Workshops, pages 2326–2335, 2020.
- Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Trans. Pattern Anal. Mach. Intell., 43(2):532–548, 2021.
- Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 90:337–345, 2019.
- Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognition, 90:109–118, 2019.
- Pointer sentinel mixture models. In Int. Conf. on Learning Representations, 2017.
- Top-down and bottom-up cues for scene text recognition. In Proc. Conf. Comput. Vision Pattern Recognition, 2012.
- Seq-ups: Sequential uncertainty-aware pseudo-label selection for semi-supervised text recognition. In Winter Conf. on App. of Comput. Vision, 2023.
- Recognizing text with perspective distortion in natural scenes. In ICCV, pages 569–576, 2013.
- Self-critical sequence training for image captioning. In CVPR, 2017.
- A robust arbitrary text detection system for natural scene images. Expert Syst. Appl., 41(18):8027–8048, 2014.
- In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In ICLR, 2021.
- An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell., 39(11):2298–2304, 2017.
- ASTER: an attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell., 41(9):2035–2048, 2019.
- Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In NIPS, 2020.
- A. Tarvainen and H. Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Neural Inform. Process. Syst., pages 1195–1204, 2017.
- Attention is all you need. In Neural Inform. Process. Syst., pages 5998–6008, 2017.
- Coco-text: Dataset and benchmark for text detection and recognition in natural images. CoRR, abs/1601.07140, 2016.
- End-to-end scene text recognition. In Proc. Int. Conf. Comput. Vision, pages 1457–1464, 2011.
- Multi-granularity prediction for scene text recognition. In European Conf. Comput. Vision, 2022.
- From two to one: A new scene text recognizer with visual language modeling network. In Proc. Int. Conf. Comput. Vision, pages 14174–14183, 2021.
- Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proc. Conf. Comput. Vision Pattern Recognition, 2023.
- Unsupervised data augmentation for consistency training. In Neural Inform. Process. Syst., 2020.
- Toward understanding wordart: Corner-guided transformer for scene text recognition. In European Conf. Comput. Vision, volume 13688, pages 303–321, 2022.
- Primitive representation learning for scene text recognition. In CVPR, 2021.
- Symmetry-constrained rectification network for scene text recognition. In Proc. Int. Conf. Comput. Vision, pages 9146–9155, 2019.
- Reading and writing: Discriminative and generative modeling for self-supervised text recognition. In Int. Conf. Multimedia, 2022.
- Sahan: Scale-aware hierarchical attention network for scene text recognition. Pattern Recognition Letters, 2020.
- Context-based contrastive learning for scene text recognition. In AAAI, 2022.
- Pmmn: pre-trained multi-modal network for scene text recognition. Pattern Recognition Letters, 2021.
- Pushing the performance limit of scene text recognizer without human annotation. In Proc. Conf. Comput. Vision Pattern Recognition, pages 14116–14125, 2022.
- Cascaded segmentation-detection networks for text-based traffic sign detection. IEEE transactions on intelligent transportation systems, 19(1):209–219, 2017.
- Mingkun Yang (16 papers)
- Biao Yang (48 papers)
- Minghui Liao (29 papers)
- Yingying Zhu (39 papers)
- Xiang Bai (221 papers)