Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining (2206.00311v3)

Published 1 Jun 2022 in cs.CV

Abstract: Text images contain both visual and linguistic information. However, existing pre-training techniques for text recognition mainly focus on either visual representation learning or linguistic knowledge learning. In this paper, we propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework. We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images, which allows us to learn strong visual representations. In contrast to introducing linguistic knowledge with an additional LLM, we directly pre-train the sequence decoder. Specifically, we transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the LLMing capability of the sequence decoder using a proposed masked image-LLMing scheme. Significantly, the encoder is frozen during the pre-training phase of the sequence decoder. Experimental results demonstrate that our proposed method achieves superior performance on benchmark datasets, including Chinese and English text images.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Pengyuan Lyu (19 papers)
  2. Chengquan Zhang (29 papers)
  3. Shanshan Liu (32 papers)
  4. Meina Qiao (1 paper)
  5. Yangliu Xu (2 papers)
  6. Liang Wu (138 papers)
  7. Kun Yao (32 papers)
  8. Junyu Han (53 papers)
  9. Errui Ding (156 papers)
  10. Jingdong Wang (236 papers)
Citations (37)
X Twitter Logo Streamline Icon: https://streamlinehq.com