Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding (2403.12895v1)

Published 19 Mar 2024 in cs.CV
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Abstract: Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal LLMs (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5.

Unified Structure Learning for OCR-free Document Understanding with DocOwl 1.5

Introduction to Unified Structure Learning

In the quest to enhance the capabilities of Multimodal LLMs (MLLMs) in understanding text-rich document images without relying on Optical Character Recognition (OCR), this paper introduces Unified Structure Learning and presents DocOwl 1.5, a model that significantly improves upon the state-of-the-art. The principal innovation lies in the comprehensive approach to encoding structure information across different types of text-rich images, including documents, tables, charts, webpages, and natural images. Traditional MLLMs struggle with such images due to their reliance on visual encoders trained predominantly on natural image-text pairs, which do not optimally represent the textual and structural intricacies of document images.

Key Contributions

The contributions of this work are manifold:

  • Introduction of Unified Structure Learning which comprises structure-aware parsing tasks and multi-grained text localization tasks, covering a broad spectrum of document types.
  • Design of a highly effective vision-to-text module, termed H-Reducer, which efficiently processes high-resolution images while preserving vital layout information.
  • Construction of a novel dataset, DocStruct4M, specifically designed to facilitate Unified Structure Learning, alongside a reasoning tuning dataset DocReason25K aimed at eliciting model's detailed explanation capabilities.
  • Demonstrated superiority of DocOwl 1.5 over existing models, achieving significant performance gains on 10 benchmark visual document understanding tasks.

The Innovation of Unified Structure Learning

Unified Structure Learning is at the heart of DocOwl 1.5's advancements. Distinctly, it focuses on understanding not just the text but the structure within text-rich images through structure-aware parsing and multi-grained text localization across diverse domains. For structure-aware parsing, the model learns to interpret documents, tables, charts, webpages, and natural images by encoding structural cues such as line feeds, spaces, and extended Markdown syntax to represent complex structures like tables and charts. In doing so, it enhances the model's comprehension of document layout beyond mere text recognition.

The multi-grained text localization tasks enrich the model's precision in correlating text to its spatial context within images. This dual approach, bridging text recognition and structural understanding, equips the model to tackle a wide array of visual document understanding tasks.

Architectural Advancements

DocOwl 1.5 leverages H-Reducer, a vision-to-text module crafted for balancing efficiency with the retention of spatial and layout information critical for high-resolution document image processing. Unlike traditional modules that either elongate visual feature sequences or compromise spatial information fidelity, H-Reducer employs convolution to aggregate horizontally adjacent visual features. This significantly reduces visual feature sequence lengths while maintaining the relative positional relationships essential for accurately interpreting text-rich documents.

Comprehensive Dataset Construction

The creation of DocStruct4M and DocReason25K datasets marks a pivotal stride towards fostering model training and evaluation in OCR-free document understanding. DocStruct4M supports Unified Structure Learning by offering a rich compilation of structure-aware sequences and multi-grained pairs of text and bounding boxes, spanning across varied document types. Concurrently, DocReason25K aids in refining the model's ability to generate detailed explanations by providing high-quality instruction tuning focused on reasoning within document domains.

Empirical Validation and Theoretical Implications

DocOwl 1.5's empirical achievements underscore its unprecedented capabilities in visual document understanding tasks. Achieving a significant performance leap across 10 visual document understanding benchmarks, DocOwl 1.5 not only sets new performance standards but also highlights the efficacy of Unified Structure Learning in holistically parsing and understanding diverse document types without OCR dependency.

This research holds profound practical and theoretical implications, paving the way for enhanced document understanding that could redefine OCR-free MLLM applications in various domains. Further on, it opens avenues for exploring novel multimodal learning strategies that could further bridge the gap between human-like understanding and AI in processing complex visual documents.

Conclusion

In summary, this work's innovative approach to Unified Structure Learning, coupled with the introduction of H-Reducer and the meticulous assembly of specialized datasets, propels DocOwl 1.5 to the forefront of OCR-free visual document understanding. It signifies a substantial advancement in the field, offering a robust foundation for future explorations aimed at further unraveling the intricacies of multimodal understanding in text-rich image contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198, 2022.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  3. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023b.
  4. DUE: end-to-end document understanding benchmark. In NeurIPS Datasets and Benchmarks, 2021.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  7. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, pages 3558–3568. Computer Vision Foundation / IEEE, 2021.
  8. Tabfact : A large-scale dataset for table-based fact verification. In International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, April 2020.
  9. Websrc: A dataset for web-based structural reading comprehension. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4173–4185, 2021.
  10. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023.
  11. TURL: table understanding through representation learning. SIGMOD Rec., 51(1):33–40, 2022.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR. OpenReview.net, 2021.
  13. Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding. CoRR, abs/2311.11810, 2023.
  14. Evaluation of deep convolutional nets for document image classification and retrieval. In ICDAR, pages 991–995. IEEE Computer Society, 2015.
  15. Cogagent: A visual language model for GUI agents. CoRR, abs/2312.08914, 2023.
  16. Question-controlled text-aware image captioning. In ACM Multimedia, pages 3097–3105. ACM, 2021.
  17. mplug-paperowl: Scientific diagram analysis with the multimodal large language model. arXiv preprint arXiv:2311.18248, 2023.
  18. Layoutlmv3: Pre-training for document AI with unified text and image masking. In ACM Multimedia, pages 4083–4091. ACM, 2022.
  19. DVQA: understanding data visualizations via question answering. In CVPR, pages 5648–5656. Computer Vision Foundation / IEEE Computer Society, 2018.
  20. Figureqa: An annotated figure dataset for visual reasoning. In ICLR (Workshop). OpenReview.net, 2018.
  21. Chart-to-text: A large-scale benchmark for chart summarization. In ACL (1), pages 4005–4023. Association for Computational Linguistics, 2022.
  22. Ocr-free document understanding transformer. In ECCV (28), volume 13688 of Lecture Notes in Computer Science, pages 498–517. Springer, 2022.
  23. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 18893–18912. PMLR, 2023.
  24. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. CoRR, abs/2301.12597, 2023a.
  25. Monkey: Image resolution and text label are important things for large multi-modal models. CoRR, abs/2311.06607, 2023b.
  26. Improved baselines with visual instruction tuning. CoRR, abs/2310.03744, 2023a.
  27. Visual instruction tuning. CoRR, abs/2304.08485, 2023b.
  28. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023c.
  29. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL (Findings), pages 2263–2279. Association for Computational Linguistics, 2022.
  30. Docvqa: A dataset for VQA on document images. In WACV, pages 2199–2208. IEEE, 2021.
  31. Infographicvqa. In WACV, pages 2582–2591. IEEE, 2022.
  32. Plotqa: Reasoning over scientific plots. In WACV, pages 1516–1525. IEEE, 2020.
  33. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  34. Compositional semantic parsing on semi-structured tables. In ACL (1), pages 1470–1480. The Association for Computer Linguistics, 2015.
  35. Kosmos-2: Grounding multimodal large language models to the world. CoRR, abs/2306.14824, 2023.
  36. Learning transferable visual models from natural language supervision. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
  37. LAION-5B: an open large-scale dataset for training next generation image-text models. In NeurIPS, 2022.
  38. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL (1), pages 2556–2565. Association for Computational Linguistics, 2018.
  39. Textcaps: A dataset for image captioning with reading comprehension. In ECCV (2), volume 12347 of Lecture Notes in Computer Science, pages 742–758. Springer, 2020.
  40. Towards VQA models that can read. In CVPR, pages 8317–8326. Computer Vision Foundation / IEEE, 2019.
  41. Kleister: Key information extraction datasets involving long documents with complex layouts. In ICDAR (1), volume 12821 of Lecture Notes in Computer Science, pages 564–579. Springer, 2021.
  42. S Svetlichnaya. Deepform: Understand structured documents at scale, 2020.
  43. Visualmrc: Machine reading comprehension on document images. In AAAI, pages 13878–13888. AAAI Press, 2021.
  44. Vistext: A benchmark for semantically rich chart captioning. In ACL (1), pages 7268–7298. Association for Computational Linguistics, 2023a.
  45. Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264, 2023b.
  46. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  47. Ccpdf: Building a high quality corpus for visually rich documents from web crawl data. In ICDAR (3), volume 14189 of Lecture Notes in Computer Science, pages 348–365. Springer, 2023.
  48. Vicuna. Vicuna: An open chatbot impressing gpt-4. https://github.com/lm-sys/FastChat, 2023.
  49. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 23318–23340. PMLR, 2022.
  50. Cogvlm: Visual expert for pretrained language models. CoRR, abs/2311.03079, 2023a.
  51. Layout and task aware instruction prompt for zero-shot document image question answering. CoRR, abs/2306.00526, 2023b.
  52. E2E-VLP: end-to-end vision-language pre-training enhanced by visual learning. In ACL/IJCNLP (1), pages 503–513. Association for Computational Linguistics, 2021a.
  53. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In ACL/IJCNLP (1), pages 2579–2591. Association for Computational Linguistics, 2021b.
  54. TAP: text-aware pre-training for text-vqa and text-caption. In CVPR, pages 8751–8761. Computer Vision Foundation / IEEE, 2021.
  55. mplug-docowl: Modularized multimodal large language model for document understanding. CoRR, abs/2307.02499, 2023a.
  56. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. In EMNLP (Findings), pages 2841–2858. Association for Computational Linguistics, 2023b.
  57. mplug-owl: Modularization empowers large language models with multimodality. CoRR, abs/2304.14178, 2023c.
  58. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. CoRR, abs/2311.04257, 2023d.
  59. Mm-llms: Recent advances in multimodal large language models. arXiv preprint arXiv:2401.13601, 2024.
  60. MPMQA: multimodal question answering on product manuals. CoRR, abs/2304.09660, 2023a.
  61. Tablellama: Towards open large generalist models for tables. CoRR, abs/2311.09206, 2023b.
  62. A survey of large language models. CoRR, abs/2303.18223, 2023.
  63. Image-based table recognition: Data, model, and evaluation. In ECCV (21), volume 12366 of Lecture Notes in Computer Science, pages 564–580. Springer, 2020.
  64. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Anwen Hu (22 papers)
  2. Haiyang Xu (67 papers)
  3. Jiabo Ye (17 papers)
  4. Ming Yan (190 papers)
  5. Liang Zhang (357 papers)
  6. Bo Zhang (633 papers)
  7. Chen Li (386 papers)
  8. Ji Zhang (176 papers)
  9. Qin Jin (94 papers)
  10. Fei Huang (408 papers)
  11. Jingren Zhou (198 papers)
Citations (68)
Youtube Logo Streamline Icon: https://streamlinehq.com