Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding (2212.09621v1)

Published 19 Dec 2022 in cs.CL and cs.CV

Abstract: Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding~(VDU). While various vision-language pre-training objectives are studied in existing solutions, the document textline, as an intrinsic granularity in VDU, has seldom been explored so far. A document textline usually contains words that are spatially and semantically correlated, which can be easily obtained from OCR engines. In this paper, we propose Wukong-Reader, trained with new pre-training objectives to leverage the structural knowledge nested in document textlines. We introduce textline-region contrastive learning to achieve fine-grained alignment between the visual regions and texts of document textlines. Furthermore, masked region modeling and textline-grid matching are also designed to enhance the visual and layout representations of textlines. Experiments show that our Wukong-Reader has superior performance on various VDU tasks such as information extraction. The fine-grained alignment over textlines also empowers Wukong-Reader with promising localization ability.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Haoli Bai (24 papers)
  2. Zhiguang Liu (5 papers)
  3. Xiaojun Meng (23 papers)
  4. Wentao Li (40 papers)
  5. Shuang Liu (107 papers)
  6. Nian Xie (5 papers)
  7. Rongfu Zheng (1 paper)
  8. Liangwei Wang (11 papers)
  9. Lu Hou (50 papers)
  10. Jiansheng Wei (10 papers)
  11. Xin Jiang (242 papers)
  12. Qun Liu (230 papers)
Citations (9)