Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions (2401.13313v1)

Published 24 Jan 2024 in cs.CV and cs.CL

Abstract: We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and LLMs through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.

Introduction

The burgeoning field of Visual Document Understanding (VDU) calls for robust models capable of handling a diversity of document-related tasks. As such, recent research has concentrated on improving models’ abilities to interpret the intricate relationship between textual and visual objects within documents. Despite this focus, creating a universal model that effectively transfers knowledge across various document types, formats, and tasks presents a significant challenge. In particular, most visual instruction tuning datasets and models have been limited, focusing primarily on scene images or lacking the ability to adapt to a wide array of VDU tasks. Aiming to bridge this gap, a novel approach has merged human-written instructions with visual documents to drive model generalization across unencountered VDU tasks.

InstructDoc Dataset and Model Advancement

The paper introduces InstructDoc, a pioneering dataset designed to foster zero-shot generalization in VDU tasks through the utilization of instructions. InstructDoc encompasses an extensive range of 12 tasks derived from 30 diverse datasets, all formulated within a uniform instruction schema. This schema is poised to require a complex set of competencies from models, such as grasping document layouts and interpreting visual representations of texts and objects. Notably, the novel model termed InstructDr has been developed to leverage this dataset. InstructDr integrates document images, image encoders, and LLMs via a trainable bridging module coined the Document-former. This module is key to transforming documents into representations digestible by LLMs, subsequently enhancing the models' zero-shot performance across VDU tasks when supplied with instructions.

Architectural Innovations and Empirical Evaluations

InstructDr, through its Document-former, is adept at mapping visual and textual document features into a space interpretable by an LLM. Experimental results reveal that InstructDr significantly surpasses the zero-shot performance of current multimodal LLMs and outperforms ChatGPT in numerous VDU tasks when aided by instructions. Such outcomes underscore the efficacy of instructions in improving model generalization and robustness. The model's architecture also supports multi-page document comprehension by encoding multiple document images in parallel, thereby enabling intricate reasoning across pages.

Critical Reflections and Future Prospects

Despite the merits of InstructDr, the research recognizes limitations, including a dependency on OCR quality and the inability to account for correlations among multiple document-text pairs. Furthermore, the possibility to enrich the dataset with automated instruction generation and augmentation remains unexplored. In summary, the advent of InstructDoc and the development of InstructDr signify a pivotal stride toward realizing general-purpose VDU models that comprehend and execute tasks guided by natural language instructions. This research crystallizes a vital contribution to the evolution of document AI, arguably setting a new benchmark for ensuing works in the discipline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS.
  2. Docformer: End-to-End Transformer for Document Understanding. In CVPR, 993–1003.
  3. DocFormerv2: Local Features for Document Understanding. arXiv:2306.01733.
  4. PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts. In ACL-demo, 93–104.
  5. Scene Text Visual Question Answering. In ICCV, 4290–4300.
  6. DUE: End-to-End Document Understanding Benchmark. In NeurIPS.
  7. PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv:2305.18565.
  8. WebSRC: A Dataset for Web-Based Structural Reading Comprehension. In EMNLP, 4173–4185.
  9. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  10. Scaling Instruction-Finetuned Language Models. arXiv:2210.11416.
  11. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500.
  12. Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval. In ICDAR, 991–995.
  13. SciCap: Generating Captions for Scientific Figures. In EMNLP Findings, 3258–3264.
  14. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In ACMM, 4083–4091.
  15. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In ICDAR, 1516–1520.
  16. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. arXiv:2212.12017.
  17. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. In ICDARW.
  18. A Diagram Is Worth A Dozen Images. In ECCV, 235–251.
  19. Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension. In CVPR, 5376–5384.
  20. OCR-free Document Understanding Transformer. In ECCV, 498–517.
  21. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis., 123(1): 32–73.
  22. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 1956–1981.
  23. Document Understanding Dataset and Evaluation (DUDE). arXiv:2305.08455.
  24. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In ICML, 18893–18912.
  25. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML.
  26. DocBank: A Benchmark Dataset for Document Layout Analysis. In COLING, 949–960.
  27. Microsoft COCO: Common Objects in Context. In ECCV, 740–755.
  28. Visual Instruction Tuning. arXiv:2304.08485.
  29. On the Hidden Mystery of OCR in Large Multimodal Models. arXiv:2305.07895.
  30. The FLAN collection: Designing Data and Methods for Effective Instruction Tuning. arXiv:2301.13688.
  31. Decoupled Weight Decay Regularization. arXiv:1711.05101.
  32. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In NeurIPS.
  33. IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. In NeurIPS.
  34. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. In ACL Findings, 2263–2279.
  35. InfographicVQA. In WACV, 1697–1706.
  36. DocVQA: A Dataset for VQA on Document Images. In WACV, 2200–2209.
  37. OCR-VQA: Visual Question Answering by Reading Text in Images. In ICDAR, 947–952.
  38. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In ACL, 3470–3487.
  39. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
  40. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing. In Workshop on Document Intelligence at NeurIPS.
  41. DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation. In KDD, 3743–3751.
  42. Learning Transferable Visual Models from Natural Language Supervision. In ICML, 8748–8763.
  43. DocILE Benchmark for Document Information Localization and Extraction. arXiv:2302.05658.
  44. Towards VQA Models That Can Read. In CVPR, 8317–8326.
  45. Spatial Dual-Modality Graph Reasoning for Key Information Extraction. arXiv:2103.14470.
  46. SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images. In AAAI, 13636–13645.
  47. VisualMRC: Machine Reading Comprehension on Document Images. In AAAI, 13878–13888.
  48. Recognition-free Question Answering on Handwritten Document Collections. In ICFHR, 259–273.
  49. Screen2words: Automatic mobile UI summarization with multimodal learning. In UIST, 498–510.
  50. Finetuned language models are zero-shot learners. In ICLR.
  51. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD, 1192–1200.
  52. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In ACL/IJCNLP, 2579–2591.
  53. MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. In ACL, 11445–11465.
  54. mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding. arXiv:2307.02499.
  55. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv:2304.14178.
  56. LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding. arXiv:2306.17107.
  57. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592.
  58. Towards Complex Document Understanding by Discrete Reasoning. In ACMM, 4857–4866.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ryota Tanaka (11 papers)
  2. Taichi Iki (4 papers)
  3. Kyosuke Nishida (23 papers)
  4. Kuniko Saito (8 papers)
  5. Jun Suzuki (86 papers)
Citations (16)