Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model (2310.05126v1)

Published 8 Oct 2023 in cs.CV and cs.AI

Abstract: Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal LLM (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive cropping module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets will be released.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Jiabo Ye (17 papers)
  2. Anwen Hu (22 papers)
  3. Haiyang Xu (67 papers)
  4. Qinghao Ye (31 papers)
  5. Ming Yan (190 papers)
  6. Guohai Xu (21 papers)
  7. Chenliang Li (92 papers)
  8. Junfeng Tian (19 papers)
  9. Qi Qian (54 papers)
  10. Ji Zhang (176 papers)
  11. Qin Jin (94 papers)
  12. Liang He (202 papers)
  13. Xin Alex Lin (1 paper)
  14. Fei Huang (408 papers)
Citations (62)