InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions (2401.13313v1)
Abstract: We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and LLMs through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.
- Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS.
- Docformer: End-to-End Transformer for Document Understanding. In CVPR, 993–1003.
- DocFormerv2: Local Features for Document Understanding. arXiv:2306.01733.
- PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts. In ACL-demo, 93–104.
- Scene Text Visual Question Answering. In ICCV, 4290–4300.
- DUE: End-to-End Document Understanding Benchmark. In NeurIPS.
- PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv:2305.18565.
- WebSRC: A Dataset for Web-Based Structural Reading Comprehension. In EMNLP, 4173–4185.
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
- Scaling Instruction-Finetuned Language Models. arXiv:2210.11416.
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500.
- Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval. In ICDAR, 991–995.
- SciCap: Generating Captions for Scientific Figures. In EMNLP Findings, 3258–3264.
- LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In ACMM, 4083–4091.
- ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In ICDAR, 1516–1520.
- OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. arXiv:2212.12017.
- FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. In ICDARW.
- A Diagram Is Worth A Dozen Images. In ECCV, 235–251.
- Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension. In CVPR, 5376–5384.
- OCR-free Document Understanding Transformer. In ECCV, 498–517.
- Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis., 123(1): 32–73.
- The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 1956–1981.
- Document Understanding Dataset and Evaluation (DUDE). arXiv:2305.08455.
- Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In ICML, 18893–18912.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML.
- DocBank: A Benchmark Dataset for Document Layout Analysis. In COLING, 949–960.
- Microsoft COCO: Common Objects in Context. In ECCV, 740–755.
- Visual Instruction Tuning. arXiv:2304.08485.
- On the Hidden Mystery of OCR in Large Multimodal Models. arXiv:2305.07895.
- The FLAN collection: Designing Data and Methods for Effective Instruction Tuning. arXiv:2301.13688.
- Decoupled Weight Decay Regularization. arXiv:1711.05101.
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In NeurIPS.
- IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. In NeurIPS.
- ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. In ACL Findings, 2263–2279.
- InfographicVQA. In WACV, 1697–1706.
- DocVQA: A Dataset for VQA on Document Images. In WACV, 2200–2209.
- OCR-VQA: Visual Question Answering by Reading Text in Images. In ICDAR, 947–952.
- Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In ACL, 3470–3487.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- CORD: A Consolidated Receipt Dataset for Post-OCR Parsing. In Workshop on Document Intelligence at NeurIPS.
- DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation. In KDD, 3743–3751.
- Learning Transferable Visual Models from Natural Language Supervision. In ICML, 8748–8763.
- DocILE Benchmark for Document Information Localization and Extraction. arXiv:2302.05658.
- Towards VQA Models That Can Read. In CVPR, 8317–8326.
- Spatial Dual-Modality Graph Reasoning for Key Information Extraction. arXiv:2103.14470.
- SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images. In AAAI, 13636–13645.
- VisualMRC: Machine Reading Comprehension on Document Images. In AAAI, 13878–13888.
- Recognition-free Question Answering on Handwritten Document Collections. In ICFHR, 259–273.
- Screen2words: Automatic mobile UI summarization with multimodal learning. In UIST, 498–510.
- Finetuned language models are zero-shot learners. In ICLR.
- LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD, 1192–1200.
- LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In ACL/IJCNLP, 2579–2591.
- MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. In ACL, 11445–11465.
- mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding. arXiv:2307.02499.
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv:2304.14178.
- LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding. arXiv:2306.17107.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592.
- Towards Complex Document Understanding by Discrete Reasoning. In ACMM, 4857–4866.
- Ryota Tanaka (11 papers)
- Taichi Iki (4 papers)
- Kyosuke Nishida (23 papers)
- Kuniko Saito (8 papers)
- Jun Suzuki (86 papers)