Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding (2307.02499v1)

Published 4 Jul 2023 in cs.CL and cs.AI
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Abstract: Document understanding refers to automatically extract, analyze and comprehend information from various types of digital documents, such as a web page. Existing Multi-model LLMs (MLLMs), including mPLUG-Owl, have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition, indicating their potential for OCR-free document understanding. Nevertheless, without in-domain training, these models tend to ignore fine-grained OCR features, such as sophisticated tables or large blocks of text, which are essential for OCR-free document understanding. In this paper, we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. Our code, models, training data and evaluation set are available at https://github.com/X-PLUG/mPLUG-DocOwl.

Modularized Multimodal LLM for Document Understanding

The paper introduces mPLUG-DocOwl, an advanced extension of mPLUG-Owl, focusing on OCR-free document understanding. This work addresses the limitations of existing Multimodal LLMs (MLLMs), which struggle with intricate document features without in-domain training. To overcome these challenges, mPLUG-DocOwl adopts a modular architecture and a unified instruction tuning strategy, enhancing its capability in document understanding tasks without relying on OCR.

Methodology

mPLUG-DocOwl leverages a modular framework to integrate visual and textual knowledge, maintaining separate modules for vision and language tasks. The framework includes a visual abstractor that aligns visual information with a LLM, a methodology inspired by mPLUG-Owl. Crucially, the model is tuned across three domains: language-only, general vision-and-language, and specific document understanding datasets. This approach diversifies the model's capabilities and improves its zero-shot performance across various tasks.

A significant contribution is the creation of an instruction tuning dataset, consisting of tasks like Visual Question Answering (VQA), Information Extraction (IE), and Natural Language Inference (NLI), converted into a unified format suitable for integration within the existing mPLUG-Owl architecture. Additionally, the paper introduces the LLMDoc evaluation set, specifically designed to assess models' abilities concerning instruct compliance and document understanding, distinguishing itself from other existing benchmarks.

Experimental Results

The research demonstrates that mPLUG-DocOwl attains superior OCR-free performance on several standard benchmarks and across diverse domains. Notably, the model shows strong results in document, table, chart, natural image, and webpage understanding. mPLUG-DocOwl consistently outperforms previous models like Dessurt, Donut, and Pix2Struct, evidencing its enhanced text understanding and layout comprehension abilities. In qualitative analyses, the model excels in understanding complex document layouts and extracting precise information from visual inputs.

Additionally, human evaluations on the LLMDoc set reflect mPLUG-DocOwl's capability to produce high-quality responses and handle complex interactions. The evaluation proposes that while mPLUG-DocOwl demonstrates marked improvements, challenges remain, particularly in tasks requiring common sense reasoning and creative content generation.

Implications and Future Work

mPLUG-DocOwl's advances suggest significant potential implications for AI-driven document analysis and processing automation. Its modular design and integrated training regimen expand the applicability of MLLMs in real-world scenarios, where diverse document types and unstructured data prevail.

Future research may explore enhancing the model's reasoning and arithmetic capabilities, addressing current limitations in handling intricate semantic relationships and multi-step problem-solving. Furthermore, incorporating real-world feedback loops or continual learning mechanisms might bolster the model's adaptability to dynamically changing document environments.

Overall, mPLUG-DocOwl contributes a robust framework to the field of document understanding, setting a precedent for future developments in OCR-free multimodal models. Its performance improvements and methodological innovations lay a foundation for subsequent exploration in integrating advanced linguistic capabilities into document processing tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  2. DUE: end-to-end document understanding benchmark. In NeurIPS Datasets and Benchmarks, 2021.
  3. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015.
  4. Tabfact : A large-scale dataset for table-based fact verification. In International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, April 2020.
  5. End-to-end document recognition and understanding with dessurt. In ECCV Workshops (4), volume 13804 of Lecture Notes in Computer Science, pages 280–296. Springer, 2022.
  6. Question-controlled text-aware image captioning. In ACM Multimedia, pages 3097–3105. ACM, 2021.
  7. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  8. Layoutlmv3: Pre-training for document AI with unified text and image masking. In ACM Multimedia, pages 4083–4091. ACM, 2022.
  9. Ocr-free document understanding transformer. In ECCV (28), volume 13688 of Lecture Notes in Computer Science, pages 498–517. Springer, 2022.
  10. Pix2struct: Screenshot parsing as pretraining for visual language understanding. CoRR, abs/2210.03347, 2022.
  11. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. In EMNLP, pages 7241–7259. Association for Computational Linguistics, 2022.
  12. Visual instruction tuning. CoRR, abs/2304.08485, 2023a.
  13. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023b.
  14. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL (Findings), pages 2263–2279. Association for Computational Linguistics, 2022.
  15. Docvqa: A dataset for VQA on document images. In WACV, pages 2199–2208. IEEE, 2021.
  16. Infographicvqa. In WACV, pages 2582–2591. IEEE, 2022.
  17. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
  18. P. Pasupat and P. Liang. Compositional semantic parsing on semi-structured tables. In ACL (1), pages 1470–1480. The Association for Computer Linguistics, 2015.
  19. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022.
  20. Textcaps: A dataset for image captioning with reading comprehension. In ECCV (2), volume 12347 of Lecture Notes in Computer Science, pages 742–758. Springer, 2020.
  21. Towards VQA models that can read. In CVPR, pages 8317–8326. Computer Vision Foundation / IEEE, 2019.
  22. Kleister: Key information extraction datasets involving long documents with complex layouts. In ICDAR (1), volume 12821 of Lecture Notes in Computer Science, pages 564–579. Springer, 2021.
  23. S. Svetlichnaya. Deepform: Understand structured documents at scale, 2020.
  24. Visualmrc: Machine reading comprehension on document images. In AAAI, pages 13878–13888. AAAI Press, 2021.
  25. Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264, 2023.
  26. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  27. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
  28. Vicuna. Vicuna: An open chatbot impressing gpt-4. https://github.com/lm-sys/FastChat, 2023.
  29. Self-instruct: Aligning language model with self generated instructions. CoRR, abs/2212.10560, 2022. doi: 10.48550/arXiv.2212.10560. URL https://doi.org/10.48550/arXiv.2212.10560.
  30. Visual chatgpt: Talking, drawing and editing with visual foundation models. CoRR, abs/2303.04671, 2023.
  31. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. CoRR, abs/2304.01196, 2023a.
  32. mplug-2: A modularized multi-modal foundation model across text, image and video. CoRR, abs/2302.00402, 2023b.
  33. Layoutlm: Pre-training of text and layout for document image understanding. In R. Gupta, Y. Liu, J. Tang, and B. A. Prakash, editors, KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 1192–1200. ACM, 2020. doi: 10.1145/3394486.3403172. URL https://doi.org/10.1145/3394486.3403172.
  34. TAP: text-aware pre-training for text-vqa and text-caption. In CVPR, pages 8751–8761. Computer Vision Foundation / IEEE, 2021.
  35. MM-REACT: prompting chatgpt for multimodal reasoning and action. CoRR, abs/2303.11381, 2023.
  36. mplug-owl: Modularization empowers large language models with multimodality. CoRR, abs/2304.14178, 2023.
  37. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Jiabo Ye (17 papers)
  2. Anwen Hu (22 papers)
  3. Haiyang Xu (67 papers)
  4. Qinghao Ye (31 papers)
  5. Ming Yan (190 papers)
  6. Yuhao Dan (4 papers)
  7. Chenlin Zhao (3 papers)
  8. Guohai Xu (21 papers)
  9. Chenliang Li (92 papers)
  10. Junfeng Tian (19 papers)
  11. Qian Qi (8 papers)
  12. Ji Zhang (176 papers)
  13. Fei Huang (408 papers)
Citations (89)
Github Logo Streamline Icon: https://streamlinehq.com