Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition (2405.05841v2)
Abstract: In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks. The code is available at https://github.com/FaltingsA/SSM.
- Sequence-to-sequence contrastive learning for text recognition. In CVPR, pages 15302–15312, 2021.
- BEit: BERT pre-training of image transformers. In International Conference on Learning Representations, 2022.
- Scene text recognition with permuted autoregressive sequence models. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 178–196. Springer, 2022.
- An empirical study of training self-supervised vision transformers. In ICCV, pages 9620–9629, 2021.
- Context autoencoder for self-supervised representation learning. International Journal of Computer Vision, pages 1–16, 2023.
- Lister: Neighbor decoding for length-insensitive scene text recognition. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- Ml-lmcl: Mutual learning and large-margin contrastive learning for improving asr robustness in spoken language understanding. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6492–6505, 2023.
- Accelerating multiple intent detection and slot filling via targeted knowledge distillation. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Levenshtein ocr. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 322–338. Springer, 2022.
- Svtr: Scene text recognition with a single visual model. In Lud De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 884–890. International Joint Conferences on Artificial Intelligence Organization, 7 2022. Main Track.
- Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In CVPR, pages 7098–7107, 2021.
- Self-supervised character-to-character distillation for text recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19473–19484, October 2023.
- Synthetic data for text localisation in natural images. In CVPR, pages 2315–2324, 2016.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- Synthetic data and artificial neural networks for natural scene text recognition. CoRR, abs/1406.2227, 2014.
- Revisiting scene text recognition: A data perspective. In Proceedings of the IEEE/CVF international conference on computer vision, 2023.
- On recognizing texts of arbitrary shapes with 2d self-attention. pages 546–547, 2020.
- Scatter: selective context attentional scene text recognizer. pages 11962–11972, 2020.
- Perceiving stroke-semantic context: Hierarchical contrastive learning for robust scene text recognition. In AAAI, 2022.
- Towards balanced alignment: Modal-enhanced semantic modeling for video moment retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 3855–3863, 2024.
- Siman: exploring self-supervised representation learning of scene text via similarity-aware normalization. pages 1039–1048, 2022.
- Maskocr: Text recognition with masked encoder-decoder pretraining, 2022.
- Multi-modal text recognition networks: Interactive enhancements between visual and semantic features. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 446–463. Springer, 2022.
- Icdar2019 robust reading challenge on multi-lingual scene text detection and recognition—rrc-mlt-2019. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1582–1587. IEEE, 2019.
- Nrtr: A no-recurrence sequence-to-sequence model for scene text recognition. In 2019 International conference on document analysis and recognition (ICDAR), pages 781–786. IEEE, 2019.
- An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. TPAMI, pages 2298–2304, 2017.
- ASTER: an attentional scene text recognizer with flexible rectification. TPAMI, pages 2035–2048, 2019.
- Siamese image modeling for self-supervised vision representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2132–2141, 2023.
- Gated recurrent convolution neural network for ocr. Advances in Neural Information Processing Systems, 30, 2017.
- Decoupled attention network for text recognition. pages 12216–12224, 2020.
- Scene text image super-resolution in the wild. volume 12355, pages 650–666. Springer, 2020.
- From two to one: A new scene text recognizer with visual language modeling network. In ICCV, pages 14174–14183, 2021.
- Multi-granularity prediction for scene text recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 339–355. Springer, 2022.
- Rethinking text segmentation: A novel dataset and a text-specific refinement approach. pages 12045–12055. IEEE Computer Society, 2021.
- Reading and writing: Discriminative and generative modeling for self-supervised text recognition. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4214–4223, 2022.
- Towards accurate scene text recognition with semantic reasoning networks. pages 12110–12119, 2020.
- Robustscanner: Dynamically enhancing positional clues for robust text recognition. pages 135–151, 2020.
- Linguistic more: Taking a further step toward efficient and accurate scene text recognition. arXiv preprint arXiv:2305.05140, 2023.
- Choose what you need: Disentangled representation learning for scene text recognition, removal and editing. arXiv preprint arXiv:2405.04377, 2024.
- Mrn: Multiplexed routing network for incremental multilingual text recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- Zuan Gao (4 papers)
- Yuxin Wang (132 papers)
- Yadong Qu (7 papers)
- Boqiang Zhang (11 papers)
- Zixiao Wang (25 papers)
- Jianjun Xu (21 papers)
- Hongtao Xie (48 papers)