Papers
Topics
Authors
Recent
2000 character limit reached

Masked Vision-Language Transformers for Scene Text Recognition

Published 9 Nov 2022 in cs.CV | (2211.04785v1)

Abstract: Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes. Recent STR models benefit from taking linguistic information in addition to visual cues into consideration. We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information. Our encoder is a Vision Transformer, and our decoder is a multi-modal Transformer. MVLT is trained in two stages: in the first stage, we design a STR-tailored pretraining method based on a masking strategy; in the second stage, we fine-tune our model and adopt an iterative correction method to improve the performance. MVLT attains superior results compared to state-of-the-art STR models on several benchmarks. Our code and model are available at https://github.com/onealwj/MVLT.

Citations (3)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.