Masked Vision-Language Transformers for Scene Text Recognition (2211.04785v1)

Published 9 Nov 2022 in cs.CV

Abstract: Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes. Recent STR models benefit from taking linguistic information in addition to visual cues into consideration. We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information. Our encoder is a Vision Transformer, and our decoder is a multi-modal Transformer. MVLT is trained in two stages: in the first stage, we design a STR-tailored pretraining method based on a masking strategy; in the second stage, we fine-tune our model and adopt an iterative correction method to improve the performance. MVLT attains superior results compared to state-of-the-art STR models on several benchmarks. Our code and model are available at https://github.com/onealwj/MVLT.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (5)

Jie Wu (230 papers)
Ying Peng (12 papers)
Shengming Zhang (5 papers)
Weigang Qi (1 paper)
Jian Zhang (542 papers)

Citations (3)

View on Semantic Scholar

GitHub

GitHub - onealwj/MVLT: PyTorch implementation of BMVC2022 paper Masked Vision-Language Transformers for Scene Text Recognition (29 stars)

Masked Vision-Language Transformers for Scene Text Recognition (2211.04785v1)

Related Papers

GitHub