Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning (2403.00816v3)
Abstract: Understanding the contents of multimodal documents is essential to accurately extract relevant evidence and use it for reasoning. Existing document understanding models tend to generate answers with a single word or phrase directly, ignoring the source document's evidence and lacking interpretability. In this work, we address the lack of step-wise capabilities through data augmentation and extension. Specifically, We use Multi-modal LLMs (MLLMs), which have strong visual understanding and reasoning abilities, as data generators to generate step-wise question-and-answer pairs for document images and use a high-performance LLM as the error detector to filter out noisy data. This step-wise data generation pipeline is implemented using both template-based and few-shot methods. We then use the generated high-quality data to train a humanized document understanding and reasoning model, specifically designed to solve complex questions that require reasoning or multi-hop question answering, dubbed DocAssistant. Experimental results demonstrate the effectiveness and application value of step-wise generation, showing a 5 improvement on InfoVQA with complex layouts and a 7 improvement on ChartQA with complex reasoning, compared to directly generated answers. We hope our work highlights the potential of synthetic data and encourages further exploration of multi-modal document reasoning capabilities.
- Docformerv2: Local features for document understanding. arXiv preprint arXiv:2306.01733.
- Scene text visual question answering. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4291–4301.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
- Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
- Structured pruning adapters. arXiv preprint arXiv:2211.10155.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, volume 1, page 2.
- Ocr-free document understanding transformer. In European Conference on Computer Vision, pages 498–517. Springer.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Conference on Machine Learning, pages 18893–18912. PMLR.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602.
- Sail: Search-augmented instruction learning. arXiv preprint arXiv:2305.15225.
- Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706.
- Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209.
- Rankvicuna: Zero-shot listwise document reranking with open-source large language models. arXiv preprint arXiv: 2309.15088.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083.
- Visualmrc: Machine reading comprehension on document images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13878–13888.
- Unifying vision, text, and layout for universal document processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19254–19264.
- Hierarchical multimodal transformers for multipage docvqa. Pattern Recognition, 144:109834.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Document understanding dataset and evaluation (dude). In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19528–19540.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.
- Layoutlm: Pre-training of text and layout for document image understanding. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1192–1200.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
- mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499.
- Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126.
- Structextv2: Masked visual-textual prediction for document image pre-training. arXiv preprint arXiv:2303.00289.
- Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
- Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.
- Jinxu Zhang (2 papers)